bensyverson
an hour ago
The article is based on running Qwen 3.6 on a 128GB MacBook Pro. For reference, a 128GB MBP currently starts at $6699 USD [0]
Some people will be happy to pay that premium for privacy, but at roughly 10X the cost of a MacBook Neo, that money could also buy a lot of credits on OpenRouter or frontier labs.
[0]: https://www.apple.com/shop/buy-mac/macbook-pro/14-inch-space...
dofm
an hour ago
The maths there is pretty undeniable, but it is not where I'd make the split. Having a machine that can run some modest local LLMs, like the Gemma 4 12B, is really worth it.
I don't know how much serious hands-free agentic coding I will ever do on my MacBook alone, but I do know that I would not have got so far into understanding this without tinkering with local models, llama.cpp, LM Studio, and LM Studio and all that.
I totally struggled to find the right frame of mind to explore any of this stuff without feeling defeated and bamboozled. Because it's just huge, exhausting, jargon-drenched, unknowable, and I am over the hill at fifty-plus.
Until, that is, I could poke around with setting it up on my own (secondhand) machine, watching the API calls, understanding some of the terminology. I didn't even buy the machine for that; it's just adequate to the task.
The Neo is too small to really get much benefit from this opportunity to make it more visceral and knowable.
pizza234
13 minutes ago
> Having a machine that can run some modest local LLMs, like the Gemma 4 12B, is really worth it.
Cloud models are (much) faster, they don't consume so much power/generate heat, they have much bigger (LLM) context, they're much more precise and they have a much wider (engineering) context of the given problem.
Except privacy and use cases that are blocked by cloud models (e.g. reverse engineering), local LLMs are currently an expensive toy.
When I try to program with a local LLM (I'm on a 32/128 GB system), I end up wasting time compared to a cloud LLM.
ddalex
36 minutes ago
I just got Claude to download and install all the models and servers and agents and prepare all the launch scripts for me... no need to learn, just ask it to do it for you
dofm
16 minutes ago
Right, but I am a middle-aged bloke who is experiencing existential angst about whether I can carry on in this industry.
I have a pretty deep, maybe paranoid need to be confident I have an intrinsic understanding, and I have found in my life that lessons come to you when you make yourself open to learning.
So I need to build on top of what I know, taking as much of the hard way as I can bear to take at any one time — it has to be not quite difficult enough to put me off.
I can't really explain what I have learned this way that is different, but I feel it in a way that I wouldn't if I'd simply pushed a button.
sorokod
7 minutes ago
Then what is the point of ddalex?
rusk
an hour ago
> I totally struggled to find the right frame of mind to explore any of this stuff without feeling defeated and bamboozled.
I found LM studio to be a nice starting point. Frindlier and more featureful than Ollama and not as intimidating as llama.cpp (though you will want to use that eventually)
dofm
13 minutes ago
LM Studio is also nice because of the way the interface explains things; parameters have explanations and hints. It has been designed by people who really care about making it understandable.
I tried Ollama but I've settled on Unsloth Studio generally; once things really settle down I'll just run the llama-server UI, which is pretty nice.
A friend is tinkering with LLMs for amusement on a 16GB Raspberry Pi 5, and when I explained that llama.cpp now had a typical web chat interface he was so happy — it's amazing what the "table stakes" are now.
cyanydeez
24 minutes ago
I've setup to local paradigms for local coding:
- opencode with it's webui
- deer-flow with it's research/powered front end
They both run websites so you don't have to baby sit them (eg, keep your mac open). I've build a pdf compressor over a few days by first having deer flow try and research the frameworks and pipeline. It stalls out because its not really a fluid programmer. Once it stalls out, I transferred it (manually for now) to opencode and it's refactoring it because it's just a collective bundle of sticks and it needs a lot of testing to tweak out the limited scop context. LLMs can't really hold large scopes (locally anyway, from what I've read from HN, it's possible with longer context).
It'll complete in a few days with maybe 3-4 hours of full attention interaction, but it's running 3x that without my attention. Obviously, if I paid more attention it'd run quicker, but since it's local, it's not pumping out large volumes of code, it's mostly looping over tests and capabilities as observed.
It's running Qwen3.6 35B MoE on a AMD 128GB strix halo. If I switched to the dense models, perhaps it'd be smarter, but the trade off seems to be much slower gen.
dofm
20 minutes ago
> - opencode with it's webui
Have you tried Paseo?
I have opencode in a VM, and the paseo daemon running in the VM, and then the Paseo Mac app. Really nice.
(You can also use the Opencode GUI to frame a remote opencode web interface)
montebicyclelo
3 minutes ago
Isn't the directionality important. I.e. it is currently possible to run useful / great models locally, but on high end machines; and in a few years we will likely be able to run even better models on standard machines.
porphyra
an hour ago
You can also run Qwen 3.6 27B dense model on DGX Spark with comparable performance [1][2] for about $4000 (Asus Ascent GX10 is $3999 at various retailers).
In theory you can also get 48GB of VRAM with, say, two 3090s, but it will take up a lot of space and generate a lot of heat compared to the Macbook Pro and GB10.
esperent
an hour ago
> 48GB of VRAM with, say, two 3090s
So like... $2000+ just for the used GPUs? Plus I assume it's considerably more effort to get it working.
fluoridation
25 minutes ago
>Plus I assume it's considerably more effort to get it working.
Nah, not really. It is a little annoying in terms of space and power, though. Not every case and motherboard can support cards that big.
Catloafdev
an hour ago
The model they reference can be easily run with 24gb+ of VRAM, and there are other similar models capable of running easily on 16gb of VRAM. It's not like 128gb is a requirement here.
bitexploder
6 minutes ago
For a MBP I have 48 GB of RAM M5 Pro. It runs at about 12-14 t/s at Q4, you could probably optimize it further. RAM is not a limitation but overall memory bandwidth. Q8 is slower. 35B A3B Qwen is quite speedy, but a little less accurate. With Qwen 3.6 27B dense I can squeeze a 9B parameter model and use that for fast analysis or code scanning while 27B is churning on a task in the background. It is tight, but totally reasonable.
The real sweet spot for Qwen 27B is getting it on something like a Dual 3090 system or some other config where it can blaze at 50-80 t/s and that costs well under 6K currently. It is a surprisingly capable model. Using something like GLM for orchestration, specs, task farming and then letting Qwen churn is relatively inexpensive.
Overall I recommend people try models of this class out using OpenCode and some for pay service to experiment with them and understand how they work. I find they are very useful.
Long term, I am convinced enough that if I wanted to use local models for any number of reasons I would be okay investing in a dual GPU box. The Mac is not fast enough for me and M5 Max is just too expensive relative to GPU linux box. Still, it is nice to have the models local ON the laptop and it is useful for what I care about locally.
thewebguyd
an hour ago
I'd go for at least 32GB+. It'll fit in 24GB but leaves you little to no room for context, and that's at 4-bit quantization.
If you want to run unquantized, you definitely need 128GB.
bitexploder
4 minutes ago
It also comes down to inference speed, not "can I run this". 8-bit quant is quite a bit slower on an M5 Pro.
Catloafdev
an hour ago
Nobody runs unquantized, there's literally no reason to. Q8 would be the largest anyone actually runs on consumer hardware for inference.
nok22kon
19 minutes ago
a computer with 24 GB VRAM is at least $3000
sleepyeldrazi
2 minutes ago
I can't speak for the US, but in Germany (where hardware is usually more expensive, not less), I got my 3090 3 months ago for 750 euro and have been running the iq4_nl 27B using q4 kv (which after recent patches in llama.cpp is in my xp indistinguishably accurate from q8 of f16) at full ctx, with MTP at 2, peaking around 70 t/s on small ctx, around 50 t/s when im around 64k and ends around 40 t/s near the cap. The rest of the PC is a 50 euro ddr3 16gb i5 4th gen box, absolutely nothing special. And this setup is often more useful than dsv4pro (and sometimes kimi, but not glm) for research and ML work.
throw1234567891
11 minutes ago
But the tokens or credits are gone. MacBook stays. You can run other models on the same MacBook. What I read people burn every month on saas… for that money you break even on that MacBook in 5 months.
Edit: it’s not just “data privacy”, when you are using Claude, you are shipping EVERYTHING to Anthropic. It’s crazy.
nozzlegear
an hour ago
Just putting it out there: I run Qwen 3.6 on my M1 Mac Studio with 64gb. It's quantized and all that, but I agree with TFA: it's the sweet spot for local development right now.
dmayle
23 minutes ago
For that price you can put together a PC with 128GB of ram ($2000) and an RTX 5090 ($3600) and get 70-100 tokens per second instead of 45
trentor
2 minutes ago
Runs fine on 2x4080s or on two 5060/5070s with 16GBVRAM... and faster than on the mac.
dannyw
an hour ago
I’m running the same model on a 48GB MBP with a q4 quant and it’s pretty decent. You definitely don’t 128GB. That’s the scale for 70B models at q8 or something.
dom96
21 minutes ago
I've been running it on my 48GB MBP too and it's not particularly great. Super slow and not near enough to the quality provided by even Claude Sonnet.
doodlesdev
40 minutes ago
How much does one of those cost in the US? Here in Brazil, your notebook is worth as much as a used Honda Fit, which seems absolutely insane. For comparison, the ThinkPad I'm currently running cost me 1/20 of how much this MBP costs here, leaving me with over $8.000 to spend with LLM inference (if I actually spent money with that).
dannyw
30 minutes ago
I purchased mine for approximately $4400 AUD before the price hikes. That unit is now ~$5100 AUD.
I use my MBP essentially as my workstation, it's almost always plugged in. I have a MBA (M4, 24GB RAM) that I picked up for ~A$1500 or so, and that's an amazing daily driver. I don't do local LLM inference on that unit, I can just hit my own APIs (via LM Studio) on the MBP over Tailscale.
organsnyder
an hour ago
I run Qwen 3.6 on my Framework Desktop 128GB, and it's very performant. I know Framework has had to raise the price since I preordered mine, but they're still well under half the cost of that Macbook.
stymaar
37 minutes ago
> The article is based on running Qwen 3.6 on a 128GB MacBook Pro. For reference, a 128GB MBP currently starts at $6699 USD [0]
Qwen3.6-27B would be faster on a 3090 that costs around $1000-1200 though so I don't think it's a good counter-argument.
Op just happened to have that MacBook, but it doesn't mean it's necessary to run the model.
boutell
13 minutes ago
That 3090 is going to burn 750W and it will still cap you at a 4 bit quant and ~48K context. Here's someone who worked through it:
https://github.com/noonghunna/qwen36-27b-single-3090
Flies though (50-70tps is impressive for a model this smart)
I went through roughly the same process to get it working on my M2 Macbook Pro... at awful speeds of course, since models like this one are mostly bound by memory bandwidth.
dvduval
42 minutes ago
Absolutely for the average developer the token speed is just going to be too slow for it to be workable. I think we’re looking at 2028 when memory becomes cheaper again and they’ll be a lot more people using local models.
Insanity
an hour ago
But you have to factor in that this device will last you 5-10 years. That said, I wouldn't spend almost $7k USD on this macbook lol.
petilon
an hour ago
Memory requirements of newer models will increase, so while the hardware may last 10 years it won't be able to run the latest models for 10 years.
roadside_picnic
an hour ago
My experience working in the open model space pretty deeply (both LLMs and diffusion models) for years now is that it is not quite as simple as that.
In the open model space an insane amount of effort goes into getting more powerful models to run with the same or less RAM. For example in the diffusion world many things that could not be run on easily under 24GB of VRAM actually run much better today with much less VRAM than they did a few years ago. You can do many things today with 8-16GB of VRAM that would not have been possible. At the same time the most advanced open models, like LTX 2.3 for video gen, still seem to respect 24GB of VRAM as the upper bound.
Similarly the standard "big" but localish open model for LLMs back in the day was Llama 3 70B, this was both a much worse and much larger model than Qwen 3.6 27B
So in two different spaces I've witnessed the "RAM required to run the best" decreasing or at least remaining stable, while the performance being achieved in both areas is astounding (LTX 2.3 is faster, better and more capable than the Wan 2.2 model that held popularity before it).
The biggest thing to watch out for is not just RAM/VRAM but memory bandwidth. You can try to "future proof" yourself with lots of RAM, but if it's 400 GB/S you're still constrained to smaller models.
Insanity
an hour ago
You raise a fair point, but I'm not convinced it'll offer a meaningful difference in performance as long as we're stuck with the current AI paradigm.
bluGill
an hour ago
Will they? Or will we find ways to optimize models and need less? Only time will tell.
cyanydeez
21 minutes ago
I think you have too much faith in context AGI.
at 128GB, you can find almost it's entire context for Qwen3.6 35B MoE.
Again, I think you have too much faith in extrapolation. It's like you got a baby at 0 months, then measured it at 12 months and expect it to be a giant.
simonw
an hour ago
It can't run the latest models today - GLM-5.2 class models already need 1TB+ of RAM.
... but, the models that WILL run on 128GB (or 64GB or even 32GB) models today are a huge improvement on the best models that would run in the same amount of memory six months ago.
someperson
an hour ago
In 5-10 years, incremental cloud tokens will be far cheaper (likely but not guaranteed).
georgeven
an hour ago
I have a 1500 dollar machine that can run it at 50 tok/s (3 V100s)
Dig1t
21 minutes ago
How did you buy 3 V100's for $1500??
colinsane
7 minutes ago
i like that people are taking the privacy argument seriously, after however many decades. i think there are other arguments to be made for running these locally which are less settled, but IMO the Fable debacle drives it home: the surest way to embrace this technology without worry that it will be taken away from you down the road is to physically own the compute.
cyanydeez
30 minutes ago
AMD started their 128GB Halo Strix at a pretty damn good point at ~2.5k; I got mine after the first memory bump at $3k.
I think you might be a little to into the stew here.
zdragnar
a minute ago
I got mine at the same price point, and I've been pretty pleased with it. Tailscale lets me use it from my ultrabook / lightweight laptop, no burning lap or crazy fan noises. Desktops with the amd ai+ 395 are still fairly affordable for what they can do.
I haven't tried it with https://lemonade-server.ai/ yet but I just might give it a shot.
oldfuture
an hour ago
a lot of credits? we can’t predict any price change for them
AnimalMuppet
an hour ago
How many credits would it buy? How long would it take to use them up? What's the payback period?
From what I understand, for a developer, $5000/month is maybe the high end, but $5000/year is fairly standard. (Is that accurate?) So if it pays back in 15 months, that's pretty decent. If it pays back in two months, that's spectacular.
eli
30 minutes ago
Are you comparing the cost of hosted Opus to running Qwen 3.6 locally? That doesn't really seem fair.
h4ny
an hour ago
What kind of narrative are you trying to push?
Do you know how much VRAM/unified is needed for the 27B model, which is generally regarded as better between the two compared in the article, is needed with little to no KLD loss and at 256k context?
Also, once you worked out how much memory is needed for that, maybe tell us how much a non-Apple system that you can run that (probably similarly or faster) would cost?
And when you have answered that, can you tell us how much privacy costs? Maybe also tell us how private OpenRouter is?
Edit: looking at other replies that are basically pointing out the same thing I did, I guess it's my wording. It's frustrating that people who misinform others in some nicely packaged ways or just simply uninformed get to keep doing that if they sound nice. Thanks.
kllrnohj
an hour ago
> maybe tell us how much a non-Apple system that you can run that (probably similarly or faster) would cost?
Ryzen AI Max 395+ with 128GB of unified memory can be found around $3-4k.
But 27B isn't that large, either, especially if you are ok with the quantized models. So this laptop choice seems to more be a "because they had it" rather than "this is what's necessary for this particular workflow"
h4ny
an hour ago
That's my point. You can run Qwen3.6 27B with MTP and whatever else you want to bolt onto it at 256k context for much less than even a Ryzen AI Max 395+ with 128GB would cost. Even unquantized you don't need 128 GB so given your comment and the downvotes maybe I didn't word my original comment properly for this?