hackernews client

A guide to local coding models

607 pointsposted 2 months ago

362 Comments

simonw

2 months ago

> I realized I looked at this more from the angle of a hobbiest paying for these coding tools. Someone doing little side projects—not someone in a production setting. I did this because I see a lot of people signing up for $100/mo or $200/mo coding subscriptions for personal projects when they likely don’t need to.

Are people really doing that?

If that's you, know that you can get a LONG way on the $20/month plans from OpenAI and Anthropic. The OpenAI one in particular is a great deal, because Codex is charged a whole lot lower than Claude.

The time to cough up $100 or $200/month is when you've exhausted your $20/month quota and you are frustrated at getting cut off. At that point you should be able to make a responsible decision by yourself.

kristopolous

2 months ago

I use local models + openrouter free ones.

My monthly spend on ai models is < $1

I'm not cheap, just ahead of the curve. With the collapse in inference cost, everything will be this eventually

I'll basically do

    $ man tool | <how do I do this with the tool>

or even

    $ cat source | <find the flags and give me some documentation on how to use this>

Things I used to do intensively I now do lazily.

I've even made a IEITYuan/Yuan-embedding-2.0-en database of my manpages with chroma and then I can just ask my local documentation how I do something conceptually, get the man pages, inject them into local qwen context window using my mansnip llm preprocessor, forward the prompt and then get usable real results.

In practice it's this:

    $ what-man "some obscure question about nfs" 
    ...chug chug chug (about 5 seconds)...

    <answer with citations back to the doc pages>

Essentially I'm not asking the models to think, just do NLP and process text. They can do that really reliably.

It helps combat a frequent tendency for documentation authors to bury the most common and useful flags deep in the documentation and lead with those that were most challenging or interesting to program instead.

I understand the inclination it's just not all that helpful for me

nl

2 months ago

This is a completely different thing to AI coding models.

If you aren't using coding models you aren't ahead of the curve.

There are free coding models. I use them heavily. They are ok but only partial substitutes for frontier models.

kristopolous

2 months ago

I'm extremely familiar with them.

Some people, with some tasks, get great results

But me, with my tasks, I need to maintain provenance and accountability over the code. I can't just have AI fly by the seat of its pants.

I can get into lots of detail on this. If you have seen tools and setups I have done you'd realize why it doesn't work for me.

I've spent money, the results for me, with my tasks, have not been the right decision.

user

2 months ago

[deleted]

aquafox

2 months ago

> I'll basically do

    $ man tool | <how do I do this with the tool>

or even $ cat source | <find the flags and give me some documentation on how to use this>

Could you please elaborate on this? Do I get this right that you can set up your your command line so that you can pipe something to a command that sends this something together with a question to an LLM? Or did you just mean that metaphorically? Sorry if this is a stupid question.

mr_mitm

2 months ago

Yes, I use simonw's `llm` for that: https://github.com/simonw/llm

Example:

    $ man tar | llm "how do I extract test.txt from a tar.gz"

scottyeager

2 months ago

I'm not the OP, but I did build a tool that I use in the same way: https://github.com/scottyeager/Pal

Actually for many cases the LLM already knows enough. For more obscure cases, piping in a --help output is also sometimes enough.

__m

2 months ago

i guess op means: $ man tool | ai <how do I do this with the tool>

where ai could be a simple shell script combining the argument with stdin

m4ck_

2 months ago

Is your RAG manpages thing on github somewhere? I was thinking about doing something like that (it's high on my to-do list but I haven't actually done anything with llms yet.)

kristopolous

2 months ago

I'll get it up soon, probably should. This little snippet will help you though:

   $ man --html="$(which markitdown)" <man page>

That goes man -> html -> markdown which is not only token efficient but also llms are pretty good at creating hierarchies from markdown

r-w

2 months ago

I bet you could do the same thing with pandoc and skip serializing to HTML entirely.

2 months ago

> My monthly spend on ai models is < $1

> I'm not cheap

You're cheap. It's okay. We're all developers here. It's a safe space.

mathgeek

2 months ago

While I say this somewhat in jest, frugal is just cheap but with better value.

MuffinFlavored

2 months ago

> I'm not cheap, just ahead of the curve.

I'm not convinced.

I'm convinced you don't value your time. As Simon said, throw $20-$100/mo and get the best state of the art models with "near 0" setup and move on.

martin1975

2 months ago

this is the extent to what I use any LLM - they're really good at looking up just about anything, in natural language, and most of the time even the first hit, without reprompting, is a pretty decent answer. I used to have to sort thru things to get there, so there's definitely an upside to LLMs in this manner.

techwizrd

2 months ago

Have you looked at tldr/tealdeer[0]? It may do much of what you're looking for, albeit without LLM assistance.

0: https://tealdeer-rs.github.io/tealdeer/

Aurornis

2 months ago

The limits for the $20/month plan can be reached in 10-20 minutes when having it explore large codebases with directed. It’s also easy to blow right through the quota if you’re not managing content well (waiting until it fills up and then auto-compacting, or even using /compact frequently instead of /clear or the equivalent in different tools).

For most of my work I only need the LLM to perform a structured search of the codebase or to refactor something faster than I can type, so the $20/month plan is fine for me.

But for someone trying to get the LLM to write code for them, I could see the $20/month plans being exhausted very quickly. My experience with trying “vibecoding” style app development, even with highly detailed design documents and even providing test case expected output, has felt like lighting tokens on fire at a phenomenal rate. If I don’t interrupt every couple of commands and point out some mistake or wrong direction it can spin seemingly for hours trying to deal with one little problem after another. This is less obvious when doing something basic like a simple React app, but becomes extremely obvious once you deviate from material that’s represented a lot in training materials.

sheepscreek

2 months ago

Not for Codex. Not even for Gemini/Antigravity! I am truly shocked by how much mileage I can get out of them. I recently bought the $200/mo OpenAI subscription but could barely use 10% of it. Now for over a month, I use codex for at least 2 hrs every day and have yet to reach the quota.

With Gemini/Antigravity, there’s the added benefit of switching to Claude Code Opus 4.5 once you hit your Gemini quota, and Google is waaaay more generous than Claude. I can use Opus alone for the entire coding session. It is bonkers.

So having subscribed to all three at their lowest subscriptions (for $60/mo) I get the best of each one and never run out of quota. I’ve also got a couple of open-source model subscriptions but I’ve barely had the chance to use them since Codex and Gemini got so good (and generous).

The fact that OpenAI is only spending 30% of their revenue on servers and inference despite being so generous is just mind boggling to me. I think the good times are likely going to last.

My advise - get Gemini + Codex lowest tier subscriptions. Add some credits to your codex subscription in case you hit the quota and can’t wait. You’ll never be spending over $100 even if you’re building complex apps like me.

Aurornis

2 months ago

2 months ago

I like Pro also for better access to 5.2 Pro which is indispensable for some problems and for producing specs/code samples. I use https://gitingest.com

selcuka

2 months ago

Not the same poster, but apparently they tried the $200/mo subscription, but after seeing they don't need it, they "subscribed to all three at their lowest subscriptions (for $60/mo)" instead.

Aurornis

2 months ago

> but apparently they tried the $200/mo subscription, but after seeing they don't need it

This is why it’s confusing, though. Why start with the highest plan as the starting point when it’s so easy to upgrade?

2 months ago

I do the same and agree this works well.

It's worth noting that the Claude subscription seems notably less than the others.

Also there are good free options for code review.

sellmesoap

2 months ago

My first try at LLM coding was with Claude, got back confusing results for a hello world++ type test and ran out of credits in a couple of hours, asked for a refund all the same day. I'm slowly teaching myself prompt engineering on qwen3-coder, it goes in circles much like claude was, but at least it's doing that at the cost of electricity at the wall, I already had a GPU.

jjromeo

2 months ago

Can confirm this is the way right now

JamesSwift

2 months ago

That has not been my experience with sonnet, and even so it is largely remedied by having better AI docs caching the results of that investigation for future use.

stuaxo

2 months ago

You'd think local models could explore a codename and build up a knowledge graph of it they could use to query it.

It could take longer, but save your subscription tokens.

user

2 months ago

[deleted]

uneekname

2 months ago

Yes, we are doing that. These tools help make my personal projects come to life, and the money is well worth it. I can hit Claude Code limits within an hour, and there's no way I'm giving OpenAI my money.

_delirium

2 months ago

As a third option, I've found I can do a few hours a day on the $20/mo Google plan. I don't think Gemini is quite as good as Claude for my uses, but it's good enough and you get a lot of tokens for your $20. Make sure to enable the Gemini 3 preview in gemini-cli though (not enabled by default).

deaux

2 months ago

Huge caveat: For the $20/mo subscription Google hasn't made clear if they train on your data. Anthropic and OAI on the other hand either clearly state they don't train on paid usage or offer very straightforward opt-outs.

https://geminicli.com/docs/faq/

> What is the privacy policy for using Gemini Code Assist or Gemini CLI if I’ve subscribed to Google AI Pro or Ultra?

> To learn more about your privacy policy and terms of service governed by your subscription, visit Gemini Code Assist: Terms of Service and Privacy Policies.

> https://developers.google.com/gemini-code-assist/resources/p...

The last page only links to generic Google policies. If they didn't train on it, they could've easily said so, which they've done in other cases - e.g. for Google Studio and CLI they clearly say "If you use a billed API key we don't train, else we train". Yet for the Pro and Ultra subscriptions they don't say anything.

This also tracks with the fact that they enormously cripple the Gemini app if you turn off "apps activity" even for paying users.

If any Googlers read this, and you don't train on paying Pro/Ultra, you need to state this clearly somewhere as you've done with other products. Until then the assumption should be that you do train on it.

versteegen

2 months ago

I have no idea at all whether the GCP "Service Specific Terms" [1] apply to Gemini CLI, but they do apply to Gemini used via Github Copilot [2] (the $10/mo plan is good value for money and definitely doesn't use your data for training), and states:

  Service Terms
  17. Training Restriction. Google will not use Customer Data to train or fine-tune any AI/ML models without Customer's prior permission or instruction.

[1] https://cloud.google.com/terms/service-terms

[2] https://docs.github.com/en/copilot/reference/ai-models/model...

ayewo

2 months ago

Thanks for those links. GitHub Copilot looks like a good deal at $10/mo for a range of models.

I originally thought they only supported the previous generation models i.e. Claude Opus 4.1 and Gemini 2.5 Pro based on the copy on their pricing page [1] but clicking through [2] shows that they support far more models.

[1] https://github.com/features/copilot#pricing

[2] https://github.com/features/copilot/plans#compare

versteegen

2 months ago

Yes, it's a great deal especially because you get access to such a wide range of models, including some free ones, and they only rate limit for a couple minutes at a time, not 5 hours. And if you go over the monthly limit you can just buy more at $0.04 a request instead of needing to switch to a higher plan. The big downside is the 128k context windows.

Lately Copilot have been getting access to new frontier models the same day they release elsewhere. That wasn't the case months ago (GPT 5.1). But annoyingly you have to explicitly enable each new model.

deaux

2 months ago

Yeah Github of course has proper enterprise agreements with all the models they offer and they include a no-training clause. The $10/mo plan is probably the best value for money out there currently along with Codex $20/mo (if you can live with GPT's speed).

lostmsu

2 months ago

Are you sure about OpenAI? I thought they actually do retain your agent chats (training I am less concerned about personally).

Anthropic has an option to opt out of training and delete the chats from their cloud in 30 days.

deaux

2 months ago

I was only talking about training so you're probably right about retention - I care more about training.

_delirium

2 months ago

That's good to know, thanks. In my case nearly 100% of my code ends up public on GitHub, so I assume everyone's code models are training on it anyway. But would be worth considering if I had proprietary codebases.

w23j

2 months ago

That's the main reason, why I hope Google does not win this AI war.

someguyiguess

a month ago

My thoughts exactly. The $100 Claude subscription is the sweet spot for me. I signed up for the $20 at first and got irritated constantly hitting access limits. Then I bought the $200 subscription but never even hit 1/4 of my allocation. So the $100 would be perfect.

And this is for hobby / portfolio projects.

wyre

2 months ago

calenti

2 months ago

Well you did hire some(thing)...for $100/month.

satvikpendem

2 months ago

I'm talking about the general trend, not the exceptions. How much of the code do you manually write with the 100 dollar subscription? Vibe coding is a descriptive, not a prescriptive, label.

cmrdporcupine

2 months ago

Claude's $20 plan should be renamed to "trial". Try Opus and you will reach your limit in 10 minutes. With Sonnet, if you aren't clearing the context very often, you'll hit it within a few hours. I'm sympathetic to developers who are using this as their only AI subscription because while I was working on a challenging bug yesterday I reached the limit before it had even diagnosed the problem and had to switch to another coding agent to take over. I understand you can't expect much from a $20 subscription, but the next jump up costing $80 is demotivating.

kxrm

2 months ago

Incidentally, wondering if anyone has seen this approach of asking Claude to manage Codex:

https://www.reddit.com/r/codex/comments/1pbqt0v/using_codex_...

joshribakoff

2 months ago

To me, it doesn’t matter how cheap open AI codex is because that tool just burns up tokens, trying to switch to the wrong version of node using NVM on my machine. It spirals in a loop and never makes progress, for me, no matter how explicitly or verbosely i prompt.

On the other hand, Claude has been nothing but productive for me.

I’m also confused why you don’t assume people have the intelligence to only upgrade when needed. Isn’t that what we’re all doing? Why would you assume people would immediately sign up for the most expensive plan that they don’t need? I already assumed everyone starts on the lowest plan and quickly runs into session limits and then upgrades.

Also coaching people on which paid plan to sign up for kinda has nothing to do with running a local model, which is what this article is about

nineteen999

2 months ago

I spent about 45 mins trying to get both Claude and ChatGPT to help get Codex running on my machine (WSL2) and on a Linux NUC, they couldn't help me get it working so I gave up and went back to Claude.

c-hendricks

2 months ago

Why is an LLM trying to switch node versions?

2 months ago

From my point of view, you're either choosing between instruction following or more creative solutions.

Codex models tend to be extremely good at following instructions, to the point that it won't do any additional work unless you ask it to. GPT-5.1 and GPT-5.2 on the other hand is a little bit more creative.

Models from Anthropics on the other hand is a lot more loosy goosy on the instructions, and you need to keep an eye on it much more often.

I'm using models interchangeably from both providers all the time depending on the task at hand. No real preference if one is better then the other, they're just specialized on different things

baq

2 months ago

bit the bullet this week and paid for a month of claude and a month of chatgpt plus. claude seems to have much lower token limits, both aggregate and rate-limited and GPT-5.2 isn't a bad model at all. $20 for claude is not enough even for a hobby project (after one day!), openai looks like it might be.

InsideOutSanta

2 months ago

I feel like a lot of the criticism the GPT-5.x models receive only applies to specific use cases. I prefer these models over Anthropic's because they are less creative and less likely to take freedoms interpreting my prompts.

Sonnet 4.5 is great for vibe coding. You can give it a relatively vague prompt and it will take the initiative to interpret it in a reasonable way. This is good for non-programmers who just want to give the model a vague idea and end up with a working, sensible product.

But I usually do not want that, I do not want the model to take liberties and be creative. I want the model to do precisely what I tell it and nothing more. In my experience, te GPT-5.x models are a better fit for that way of working.

2 months ago

For sure. On one project I kept using codex just to see where the wall was. Took a long time.

deaux

2 months ago

> Are people really doing that?

Sure am. Capacity to finish personal projects has tripled for a mere $200/month. Would purchase again.

haritha-j

2 months ago

I’ve been using vs code copilot pro for a few months and never really had any issue, once you hit the limit for one model, you generally still have a bunch more models to choose from. Unless I was vibe coding massive amounts of code without looking to testing, it’s hard to imagine I will run out of all the available pro models.

deaux

2 months ago

Copilot Pro works with a total requests budget rather than per-model limits unless something changed. Could you explain?

a month ago

Maybe for very light work. But on the $20 subscription level I’d hit access limits every 3-4 hours.

shepherdjerred

2 months ago

I pay $200/mo just for Claude Code. I used Cursor for a while and used something like $600 in credits in Nov.

RickyLahey

2 months ago

2 months ago

I'm curious what the mental calculus was that a $5k laptop would competitively benchmark against SOTA models for the next 5 years was.

Somewhat comically, the author seems to have made it about 2 days. Out of 1,825. I think the real story is the folly of fixating your eyes on shiny new hardware and searching for justifications. I'm too ashamed to admit how many times I've done that dance...

Local models are purely for fun, hobby, and extreme privacy paranoia. If you really want privacy beyond a ToS guarantee, just lease a server (I know they can still be spying on that, but it's a threshold.)

ekjhgkejhgk

2 months ago

I agree with everything you said, and yet I cannot help but respect a person who wants to do it himself. It reminds me of the hacker culture of the 80s and 90s.

slicktux

2 months ago

Agreed, Everyone seems to shun the DIY hacker now a days; saying things like “I’ll just pay for it”. It’s not about just NOT paying for it but doing it yourself and learning how to do it so that you can pass the knowledge on and someone else can do it.

davidw

2 months ago

I loathe the idea of being beholden to large corporations for what may be a key part of this job in the future.

2 months ago

My 2023 Macbook Pro (M2 Max) is coming up to 3 years old and I can run models locally that are arguably "better" than what was considered SOTA about 1.5 years ago. This is of course not an exact comparison but it's close enough to give some perspective.

menaerus

2 months ago

[1] https://mistral.ai/news/devstral-2-vibe-cli

[2] https://www.anthropic.com/news/swe-bench-sonnet

cmrdporcupine

2 months ago

Thing is you can pay basically fractions of cents a query to e.g. DeepSeek Platform or DeepInfra or Z.Ai or whatever and have them run the same open models for far cheaper and faster than you could ever build out at home.

It's neat to play with, but not practical.

The only story that I can see that makes sense for running at home is if you're going to fine tune a model by taking an open weight model and <hand waving> doing things to it and running that. Even then I believe there's places (hugging face?) that will host and run your updated model for cheaper than you could run it yourself.

Aurornis

2 months ago

> Even the small Devstral-2 model (24b) seems to easily beat Sonnet 3.5 [2].

I've played with Devstral 2 a lot since it came out. I've seen the benchmarks. I just don't believe it's actually better for coding.

It's amazing that it can do some light coding locally. I think it's great that we have that. But if I had to choose between a 2024-era model and Devstral 2 I'd pick the older Sonnet or GPTs any day.

menaerus

2 months ago

> 40k in consumer hardware is never going to compete with 40k of AI specialized GPUs/servers.

For general purpose LLM probably yes. For something very domain-specialized not necessarily.

cmrdporcupine

2 months ago

With RAM prices spiking, there's no way consumers are going to have access to frontier quality models on local hardware any time soon, simply because they won't fit.

That's not the same as discounting the open weight models though. I use DeepSeek 3.2 heavily, and was impressed by the Devstral launch recently. (I tried Kimi K2 and was less impressed). I don't use them for coding so much as for other purposes... but the key thing about them is that they're cheap on API providers. I put $15 into my deepseek platform account two months ago, use it all the time, and still have $8 left.

I think the open weight models are 8 months behind the frontier models, and that's awesome. Especially when you consider you can fine tune them for a given problem domain...

satvikpendem

2 months ago

> I'm curious what the mental calculus was that a $5k laptop would competitively benchmark against SOTA models for the next 5 years was.

Well, the hardware remains the same but local models get better and more efficient, so I don't think there is much difference between paying 5k for online models over 5 years vs getting a laptop (and well, you'll need a laptop anyway, so why not just get a good enough one to run local models in the first place?).

Workaccount2

2 months ago

Even if intelligence scaling stays equal, you'll lose out on speed. A sota model pumping 200 tk/s is going to be impossible to ignore with a 4 year old laptop choking itself at 3 tk/s.

Even still, right now is when the first gen of pure LLM focused design chipsets are getting into data centers.

lelanthran

2 months ago

> Even if intelligence scaling stays equal, you'll lose out on speed. A sota model pumping 200 tk/s is going to be impossible to ignore with a 4 year old laptop choking itself at 3 tk/s.

Unless you're YOLOing it, you can review only at a certain speed, and for a certain number of hours a day.

The only tokens/s you need is one that can keep you busy, and I expect that even a slow 5token/sec model utilised 60s in every minute, 60m of every hour and 24 hours of every day is way more than you can review in a single working day.

The goal we should be moving towards is longer-running tasks, not quicker responses, because if I can schedule 30 tasks to my local LLm before bed, then wake up in the morning and schedule a different 30, and only then start reviewing, then I will spend the whole day just reviewing while the LLM is generating code for tomorrow's review. And for this workflow a local model running 5 tokens/s is sufficient.

2 months ago

30-years old calculators are still good enough for basic arithmetic and in fact even in 2025 people have one emulated on their phone that isn't more powerful than the original, and people still use them routinely.

If Claude 3 Sonnet was good enough to be your daily driver last year, surely something that is as powerful is good enough to be your daily driver today. It's not like the amount of work you must do to get paid doubled over the past year or anything.

Some people just feel the need to live always on the edge for no particular reason.

ashirviskas

2 months ago

Claude 3 Sonnet was good enough for many things, but not as universal as 4.5 Opus. It is immeasureably more useful to me.

I agree with all of your other points though.

yoan9224

2 months ago

The cost analysis here is solid, but it misses the latency and context window trade-offs that matter in practice. I've been running Qwen2.5-Coder locally for the past month and the real bottleneck isn't cost - it's the iteration speed. Claude's 200k context window with instant responses lets me paste entire codebases and get architectural advice. Local models with 32k context force me to be more surgical about what I include.

That said, the privacy argument is compelling for commercial projects. Running inference locally means no training data concerns, no rate limits during critical debugging sessions, and no dependency on external API uptime. We're building Prysm (analytics SaaS) and considered local models for our AI features, but the accuracy gap on complex multi-step reasoning was too large. We ended up with a hybrid: GPT-4o-mini for simple queries, GPT-4 for analysis, and potentially local models for PII-sensitive data processing.

The TCO calculation should also factor in GPU depreciation and electricity costs. A 4090 pulling 450W at $0.15/kWh for 8 hours/day is ~$200/year just in power, plus ~$1600 amortized over 3 years. That's $733/year before you even start inferencing. You need to be spending $61+/month on Claude to break even, and that's assuming local performance is equivalent.

estimator7292

2 months ago

I'd only consider the GPU cost if you intend to chuck it in a dumpster after three years. Why not factor in the cost of your CPU and amortize your RAM and disks?

Those aren't useful numbers.

raw_anon_1111

2 months ago

I don’t think I’ve ever read an article where the reason I knew the author was completely wrong about all of their assumptions was that they admitted it themselves and left the bad assumptions in the article.

The above paragraph is meant to be a compliment.

But justifying it based on keeping his Mac for five years is crazy. At the rate things are moving, coding models are going to get so much better in a year, the gap is going to widen.

2 months ago

This story talks about MLX and Ollama but doesn't mention LM Studio - https://lmstudio.ai/

LM Studio can run both MLX and GGUF models but does so from an Ollama style (but more full-featured) macOS GUI. They also have a very actively maintained model catalog at https://lmstudio.ai/models

ZeroCool2u

2 months ago

LMStudio is so much better than Ollama it's silly it's not more popular.

thehamkercat

2 months ago

LMStudio is not open source though, ollama is

but people should use llama.cpp instead

smcleod

2 months ago

I suspect Ollama is at least partly moving away open source as they look to raise capitol, when they released their replacement desktop app they did so as closed source. You're absolutely right that people should be using llama.cpp - not only is it truly open source but it's significantly faster, has better model support, many more features, better maintained and the development community is far more active.

calgoo

2 months ago

Only issue I have found with llama.cpp is trying to get it working with my amd GPU. Ollama almost works out of the box, in docker and directly on my Linux box.

Lapel2742

2 months ago

>Only issue I have found with llama.cpp is trying to get it working with my amd GPU.

I had no problems with ROCm 6.x but couldn't get it to run with ROCm 7.x. I switched to Vulkan and the performance seems ok for my use cases

parthsareen

2 months ago

Desktop app is open-source now.

nateb2022

2 months ago

> but people should use llama.cpp instead

MLX is a lot more performant than Ollama and llama.cpp on Apple Silicon, comparing both peak memory usage + tok/s output.

edit: LM Studio benefits from MLX optimizations when running MLX compatible models.

behnamoh

2 months ago

2 months ago

ik_llama is almost always faster when tuned. However, when untuned I've found them to be very similar in performance with varied results as to which will perform better.

But vLLM and Sglang tend to be faster than both of those.

Abishek_Muthian

2 months ago

Besides optimizations specific to running locally lands in lamma.cpp first.

midius

2 months ago

Makes me think it's a sponsored post.

Cadwhisker

2 months ago

LMStudio? No, it's the easiest way to run am LLM locally that I've seen to the point where I've stopped looking at other alternatives.

It's cross-platform (Win/Mac/Linux), detects the most appropriate GPU in your system and tells you whether the model you want to download will run within it's RAM footprint.

It lets you set up a local server that you can access through API calls as if you were remotely connected to an online service.

vunderba

2 months ago

FWIW, Ollama already does most of this:

- Cross-platform

- Sets up a local API server

The tradeoff is a somewhat higher learning curve, since you need to manually browse the model library and choose the model/quantization that best fit your workflow and hardware. OTOH, it's also open-source unlike LMStudio which is proprietary.

2 months ago

Yes, I wanted to try it already but setting up an environment with an MI50 was a bit tricky so I wanted to try something I knew first. Now that I have ollama running I will give llama.cpp a shot.

ashirviskas

2 months ago

Ooh, I have experience with it. If you're on linux, just use Vulkan. If you face any other issues, just google my username + "MI50 32GB vbios reddit". It depends on which vBIOS you have, but that post on reddit has most of the info you may need. Good luck!

thehamkercat

2 months ago

I think you should mention that LM Studio isn't open source.

I mean, what's the point of using local models if you can't trust the app itself?

rubymamis

2 months ago

You can always use something like Little Snitch to not allow it to dial home.

behnamoh

2 months ago

2 months ago

I am still hoping, but for the moment… I have been trying every 30-80B model that came out in the last several months, with crush and opencode, and it's just useless. They do produce some output, but it's nowhere near the level that claude code gets me out of the box. It's not even the same league.

With LLMs, I feel like price isn't the main factor: my time is valuable, and a tool that doesn't improve the way I work is just a toy.

That said, I do have hope, as the small models are getting better.

larodi

2 months ago

Claude Code is a lot about prompting and orchestration of the conversation. The LLM is just a tool in these agentic frameworks. Whats truly ingenious is how context is engineered/managed, how is the code-RAG approached, and them LLM memory that is used.

So my guess would be - we need open conversation or something along the line of "useful linguistic-AI approaches for combing and grooming code"

jwr

2 months ago

Agreed. I've been trying to use opencode and crush, and none of them do anything useful for me. In contrast, claude code "just works" and does genuinely useful work. And it's not just because of the specific LLM used, it's the overall engineering of the tool, the prompt behind the scenes, etc.

But the bottom line is that I still can't find a way to use either local LLMs and/or opencode and crush for coding.

sbene970

2 months ago

Search for "Claude Code Router" on GitHub, which you can use to route any models through Claude Code.

larodi

2 months ago

Which is very sad and perhaps she should be aiming to introduce some very smart linguists into the whole ML:LLM thing that can learn and explore how to best to interact with the funny archive that models are.

DrAwdeOccarim

2 months ago

I use Opus 4.5 and GPT 5.2-Codex through VS Code all day long, and the closest I've come is Devstral-Small-2-24B-Instruct-2512 inferring on a DGX Spark hosting with vLLM as an "Open AI Compatible" API endpoint I use to power the Cline VS Code extension.

It works, but it's slow. Much more like set it up and come back in an hour and it's done. I am incredibly impressed by it. There are quantized GGUFs and MLXs of the 123B, which can fit on my M3 36GB Macbook that I haven't tried yet.

But overall, it feels about about 50% too slow, which blows my mind because we are probably 9 months away from a local model that is fast and good enough for my script kiddie work.

lostmsu

2 months ago

I did the same with recent stuff and so far gpt-oss-120b on high was the best with gpt-oss-20b on high close second.

NelsonMinar

2 months ago

"This particular [80B] model is what I’m using with 128GB of RAM". The author then goes on to breezily suggest you try the 4B model instead of you only have 8GB of RAM. With no discussion of exactly what a hit in quality you'll be taking doing that.

ethmarks

2 months ago

This is like if an article titled "A guide to growing your own food instead of buying produce" explained that the author was using a four-acre plot of farmland but suggested that that reader could also use a potted plant instead. Absolutely baffling.

bjt12345

2 months ago

Here's my take on it though...

Just as we had the golden era of the internet in the late 90s, when the WWW was an eden of certificate-less homepages with spinning skulls on geocities without ad tracking, we are now in the golden era of agentic coding where massive companies make eye watering losses so we can use models without any concerns.

But this won't last and Local Llamas will become a compelling idea to use, particularly when there will be a big second hand market of GPUs from liquidated companies.

aleggg

2 months ago

Yes. This heavily subsidized LLM inference usage will not last forever.

We have already seen cost cutting for some models. A model starts strong, but over time the parent company switches to heavily quantized versions to save on compute costs.

Companies are bleeding money, and eventually this will need to adjust, even for a behemoth like Google.

That is why running local models is important.

sesm

2 months ago

Unfortunately, GPUs die in datacenters very quickly, and GPU manufacturers don't care about hardware longevity.

yread

2 months ago

Yep, when the tide goes away no company will be able to keep swimming naked offering stuff for free

cloudhead

2 months ago

In my experience the latest models (Opus 4.5, GPT 5.2) Are _just_ starting to keep up with the problems I'm throwing at them, and I really wish they did a better job, so I think we're still 1-2 years away from local models not wasting developer time outside of CRUD web apps.

OptionOfT

2 months ago

This older HN thread shows R1 running on a ~$2k box using ~512 GB of system RAM, no GPU, at ~3.5-4.25 TPS: https://news.ycombinator.com/item?id=42897205

If you scale that setup and add a couple of used RTX 3090s with heavy memory offloading, you can technically run something in the K2 class.

nl

2 months ago

Is 4 TPS actually useful for anything?

That's around 350,000 tokens in a day. I don't track my Claude/Codex usage, but Kilocode with the free Grok model does and I'm using between 3.3M and 50M tokens in a day (plus additional usage in Claude + Codex + Mistral Vibe + Amp Coder)

I'm trying to imagine a use case where I'd want this. Maybe running some small coding task overnight? But it just doesn't seem very useful.

2 months ago

I appreciate the author's modesty but the flip-flopping was a little confusing. If I'm not mistaken, the conclusion is that by "self-hosting" you save money in all cases, but you cripple performance in scenarios where you need to squeeze out the kind of quality that requires hardware that's impractical to cobble together at home or within a laptop.

I am still toying with the notion of assembling an LLM tower with a few old GPUs but I don't use LLMs enough at the moment to justify it.

a_victorp

2 months ago

If you ever do it, please make a guide! I've been toying with the same notion myself

suprjami

2 months ago

If you want to do it cheap, get a desktop motherboard with two PCIe slots and two GPUs.

Cheap tier is dual 3060 12G. Runs 24B Q6 and 32B Q4 at 16 tok/sec. The limitation is VRAM for large context. 1000 lines of code is ~20k tokens. 32k tokens is is ~10G VRAM.

Expensive tier is dual 3090 or 4090 or 5090. You'd be able to run 32B Q8 with large context, or a 70B Q6.

For software, llama.cpp and llama-swap. GGUF models from HuggingFace. It just works.

If you need more than that, you're into enterprise hardware with 4+ PCIe slots which costs as much as a car and the power consumption of a small country. You're better to just pay for Claude Code.

le-mark

2 months ago

I was going to post snark such as “you could use the same hardware to also lose money mining crypto” then realized there are a lot of crypto miners out their that could probably make more money running tokens then they do on crypto. Does such a market place exist?

hackstack

2 months ago

This is essentially vast.ai, no?

MrDrMcCoy

2 months ago

A quick glance at their homepage says they run in "secure datacenters", so no.

gkbrk

2 months ago

Indeed, his self hosting inspired me to get Qwen3:32B ollama working locally. Fits nicely on my M1 pro 32GB (running Asahi). Output is a nice read-along speed and I havent felt the need for anything more powerful.

I'd be more tempted with a maxed out M2 Ultra as an upgrade, vs tower with dedicated GPU cards. The unified memory just feels right for this task. Although I noticed the 2nd hand value of those machine jumped massively in the last few months.

I know that people turn their noses up at local LLM's, but it more than does the job for me. Plus I decided a New Years Resolution of no more subscriptions / Big-AdTech freebies.

amarant

2 months ago

Buying a maxed out MacBook Pro seems like the most expensive way to go about getting the necessary compute. Apple is notorious for overcharging for hardware, especially on ram.

I bet you could build a stationary tower for half the price with comparable hardware specs. And unless I'm missing something you should be able to run these things on Linux.

Getting a maxed out non-apple laptop will also be cheaper for comparable hardware, if portability is important to you.

kube-system

2 months ago

You need memory hooked up to the GPU. Apple’s unified memory is actually one of the cheaper ways to do this. On a typical x86-64 desktop, this means VRAM… for 100+ GB of VRAM you’re deep into tens of thousand of dollars.

Also, if you think Apple’s RAM prices are crazy… you might be surprised at what current DDR5 pricing is today. The $800 that Apple charges to upgrade a MBP from 64-128GB is the current price of 64GB desktop DDR5-6000. Which is actually slower memory than the 8533 MT/s memory you’re getting in the MacBook.

nl

2 months ago

You want unified RAM.

On Linux your options are the NVidia Spark (and other vendor versions) or the AMD Ryzen AI series.

These are good options, but there are significant trade-offs. I don't think there are Ryzen AI laptops with 128GB RAM for example, and they are pricey compared to traditional PCs.

You also have limited upgradeability anyway - the RAM is soldered.

Renaud

2 months ago

Can any x86 based system actually comes with that much unified memory?

Not an Apple fanboy, but I was under the impression that having access to up to 512GB usable GPU memory was the main feature in favour of the mac.

And now with Exo, you can even break the 512GB barrier.

embedding-shape

2 months ago

> because GPT-OSS frequently gave me “I cannot fulfill this request” responses when I asked it to build features.

This is something that frequently comes up and whenever I ask people to share the full prompts, I'm never able to reproduce this locally. I'm running GPT-OSS-120B with the "native" weights in MXFP4, and I've only seen "I cannot fulfill this request" when I actually expect it, not even once had that happen for a "normal" request you expect to have a proper response for.

Has anyone else come across this when not using the lower quantizations or 20b (So GPT-OSS-120B proper in MXFP4) and could share the exact developer/system/user prompt that they used that triggered this issue?

Just like at launch, from my point of view, this seems to be a myth that keeps propagating, and no one can demonstrate a innocent prompt that actually triggers this issue on the weights OpenAI themselves published. But then the author here seems to actually have hit that issue but again, no examples of actual prompts, so still impossible to reproduce this issue.

maranas

2 months ago

Cline + RooCode and VSCode already works really well with local models like qwen3-coder or even the latest gpt-oss. It is not as plug-and-play as Claude but it gets you to a point where you only have to do the last 5% of the work

rynn

2 months ago

What are you working on that you’ve had such great success with gpt-oss?

I didn’t try it long because I got frustrated waiting for it to spit out wrong answers.

But I’m open to trying again.

maranas

2 months ago

I use it to build some side-projects, mostly apps for mobile devices. It is really good with Swift for some reason.

I also use it to start off MVP projects that involve both frontend and API development but you have to be super verbose, unlike when using Claude. The context window is also small, so you need to know how to break it up in parts that you can put together on your own

embedding-shape

2 months ago

> What are you working on that you’ve had such great success with gpt-oss?

I'm doing programming on/off (mostly use Codex with hosted models) with GPT-OSS-120B, and with reasoning_effort set to high, it gets it right maybe 95% of the times, rarely does it get anything wrong.

ineedasername

2 months ago

I’ve been using Qwen3 Coder 30b quantized down to IQ3_XSS to fit in < 16gb vram. Blazing fast 200+ tokens per second on a 4080. I don’t ask anything complicated, but one off scripts to do something I’d normally have to do manually by hand or take an hour to write the script myself? Absolutely.

These are no more than a few dozen lines I can easily eyeball and verify with confidence- that’s done in under 60 seconds and leaves Claude code with plenty of quota for significant tasks.

throw-12-16

2 months ago

I never see devs containerize their coding agents.

It seems so obvious to me, but I guess people are happy with claude living in their home directory and slurping up secrets.

onion2k

2 months ago

The devs I work with don't put secrets in their home directories. ;)

rester324

2 months ago

How do you know? Do you snoop on their work machines?

littlestymaar

2 months ago

And where are all their software putting their data then? Unless you consider only private keys to be secrets…

(In particular the fact that Claude Code has access to your Anthropic API key is ironic given that Dario and Anthropic spend a lot of time fearmongering about how the AI could go rogue and “attempt to escape”).

throw-12-16

2 months ago

many many tools default to this, claude included

BoredPositron

2 months ago

Not worth it yet. I run a 6000 black for image and video generation, but local coding models just aren't on the same level as the closed ones.

jszymborski

2 months ago

2 months ago

I simply ask Claude Sonnet, using claudecode, to use opencode. That's it! Example:

  We need to clean up code lint and format errors across multiple files. Check which files are affected using cargo commands. Please use opencode, a coding agent that is installed. Use `opencode run <prompt>` to pass in a per-file prompt to opencode, wait for it to finish, check and ask again if needed, then move to next file. Do not work on files yourself.

baconner

2 months ago

There are a couple of decent approaches to having a planning/reviewer model set (eg. claude, codex, gemini) and an execution model (eg. glm 4.6, flash models, etc) workflow that I've tried. All three of these will let you live in a single coding cli but swap in different models for different tasks easily.

- claude code router - basically allows you to swap in other models using the real claude code cli and set up some triggers for when to use which one (eg. plan mode use real claude, non plan or with keywords use glm)

- opencode - this is what im mostly using now. similar to ccr but i find it a lot more reliable against alt models. thinking tasks go to claude, gemini, codex and lesser execution tasks go to glm 4.6 (on ceberas).

- sub-agent mcp - Another cool way is to use an mcp (or a skill or custom /command) that runs another agent cli for certain tasks. The mcp approach is neat because then your thinker agent like claude can decide when to call the execution agents, when to call in another smart model for a review of it's own thinking, etc instead of it being explicit choice from you. So you end up with the mcp + an AGENTS.md that instructs it to aggressively use the sub-agent mcp when it's a basic execution task, review, ...

I also find that with this setup just being able to tap in an alt model when one is stuck, or get review from an alt model can help keep things unstuck and moving.

KronisLV

2 months ago

lelanthran

2 months ago

> It will be like the rest of computing, some things will move to the edge and others stay on the cloud.

It will become like cloud computing - some people will have a cloud bill of $10k/m to host their apps, other people would run their app on a $15/m VPS.

Yes, the cost discrepancy will be as big as the current one we see in cloud services.

Terr_

2 months ago

I think the long term will depends on the legal/rent-seeking side.

Imagine having the hardware capacity to run things locally, but not the necessary compliance infrastructure to ensure that you aren't committing a felony under the Copyright Technofeudalism Act of 2030.

mungoman2

2 months ago

The money argument is IMHO not super strong, here as that Mac depreciates more per month than the subscription they want to avoid.

There may be other reasons to go local, but I would say that the proposed way is not cost effective.

There's also a fairly large risk that this HW may be sufficient now, but will be too small in not too long. So there is a large financial risk built into this approach.

The article proposes using smaller/less capable models locally. But this argument also applies to online tools! If we use less capable tools even the $20/mo subscriptions won't hit their limit.

altx

2 months ago

Its interesting to notice that here https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com... we default to measuring LLM coding performance as how long[~5h] a human task a model can complete with 50% success-rate (with 80% fall back for the second chart [~.5h]), while here it seems that for actual coding we really care about the last 90-100% of the costly model's performance.

ljosifov

2 months ago

Nah - given the ergonomics + economics, local coding models are not atm that viable. I like all things local even if just for safety of keeping healthy competitive ecosystem. And I can imagine really specialised uses cases where I run an 8B not-so-smart model to process oodles of data on my local 7900xtx or similar. Got older m2 mbp with 96gb (v)ram and try all things local that fit. Usually LMStudio for the speed add in MLX format models on ASI (as end point; plus chat for vibes test; LMStudio omission from the OP blog post makes me question the post), or llama.cpp for GGUF (llama.cpp is the OG; excellent and universal engine and format; recently got even better). Looking at how agents work - an agent smarts of Claude Code or Codex in using the tools feels like it's half its success (the other half the underlying LLM smarts). From the training on baked in 'Tool Use & Interleaved Thinking' on the right tools in a right way, to the trivial 'DONOTDO bad idea to fill your 100K useful context with random content of multi-MB file as prompt'. The $20/mo plans are insanely competitive. OpenaI is generous with Codex, and in addition to terminal that I mostly use, there is the VSCode addon as well as use in Cline or Roo. Cursor offers in-house model fast and good, insane economy reading large codebases, as well BYOK to latest-greatest LLMs afaik. Claude Code $20/mo is stingy with quotas, but can be supplement with Z.ai standing in - glm-4.7 as of yesterday (saw no difference glm-4.6 v.v. sonnet-4.5 already v.good). It's a 3 lines change to ~/.claude/settings.json to flip Z.ai-Anthropic back and forth at will (e.g. when paused on one to switch to the other). Have not tried the Cerebras high tok/s but wd love to - not waiting makes a ton of difference to productivity.

SpaceManNabs

2 months ago

I love that this article added a correction and took ownership in it. This encourages more people to blog stuff and then get more input for parts they missed.

The best way to get the correct answer on something is posting the wrong thing. Not sure where I got this from, but I remember it was in the context of stackoverflow questions getting the correct answer in the comments of a reply :)

Props to the author for their honesty and having the impetus to blog about this in the first place.

bilater

2 months ago

If you are using local models for coding you are midwiting this. Your code should be worth more than a subscription.

The only legit use case for local models is privacy.

Aurornis

2 months ago

> I'd be very excited if Jane Street or DE Shaw were running their trading models through Claude. Then I'd have access to billions of dollars of secrets.

Using Claude for inference does not mean the codebase gets pulled into their training set.

This is a tired myth that muddies up every conversation about LLMs

jgalt212

2 months ago

mungoman2

2 months ago

The money argument doesn't make sense here as that Mac depreciates more per month than the subscription they want to avoid.

There may be other reasons to go local, but the proposed way is not cost effective.

Ultimatt

2 months ago

For local MLX inference LM Studio is a much nicer option than Ollama

Bukhmanizer

2 months ago

Are people really so naive to think that the price/quality of proprietary models is going to stay the same forever? I would guess sometime in the next 2-3 years all of the major AI companies are going to increase the price/enshittify their models to the point where running local models is really going to be worth it.

stuaxo

2 months ago

Is the conclusion the same if you have a computer that is just for the LLM, and a separate one that runs your dev tools ?

tempodox

2 months ago

> You might need to install Node Package Manager for this.

How anyone in this day and age can still recommend this is beyond me.

KronisLV

2 months ago

My experience: even for the run of the mill stuff, local models are often insufficient, and where they would be sufficient, there is a lack of viable software.

For example, simple tasks CAN be handled by Devstral 24B or Qwen3 30B A3B, but often they fail at tool use (especially quantized versions) and you often find yourself wanting something bigger, where the speed falls a bunch. Even something like zAI GLM 4.6 (through Cerebras, as an example of a bigger cloud model) is not good enough for doing certain kinds of refactoring or writing certain kinds of scripts.

So either you use local smaller models that are hit or miss, or you need a LOT of expensive hardware locally, or you just pay for Claude Code, or OpenAI Codex, or Google Gemini, or something like that. Even Cerebras Code that gives me a lot of tokens per day isn't enough for all tasks, so you most likely will need a mix - but running stuff locally can sometimes decrease the costs.

For autocomplete, the one thing where local models would be a nearly perfect fit, there just isn't good software: Continue.dev autocomplete sucks and is buggy (Ollama), there don't seem to be good enough VSC plugins to replace Copilot (e.g. with those smart edits, when you change one thing in a file but have similar changes needed like 10, 25 and 50 lines down) and many aren't even trying - KiloCode had some vendor locked garbage with no Ollama support, Cline and RooCode aren't even trying to support autocomplete.

And not every model out there (like Qwen3) supports FIM properly, so for a bit I had to use Qwen2.5 Coder, meh. Then when you have some plugins coming out, they're all pretty new and you also don't know what supply chain risks you're dealing with. It's the one use case where they could be good, but... they just aren't.

For all of the billions going into AI, someone should have paid a team of devs to create something that is both open (any provider) and doesn't fucking suck. Ollama is cool for the ease of use. Cline/RooCode/KiloCode are cool for chat and agentic development. OpenCode is a bit hit or miss in my experience (copied lines getting pasted individually), but I appreciate the thought. The rest is lacking.

evanreichard

2 months ago

Have you tried llama.vscode [0]? I use the vim equivalent, llama.vim [1] with Qwen3 Coder 30B and personally feel that it's better than Copilot. I have hot keys that allow me to quickly switch between the two and find myself always going back to local.

2 months ago

> So I can't see bothering with this when I pumped 260M tokens through running in Auto mode on a $20/mo Cursor plan. It was my first month of a paid subscription, if that means anything. Maybe someone can explain how this works for them?

They're running at a loss and covering up the losses using VC?

> Frankly, I don't understand it at all, and I'm waiting for the other shoe to drop.

I think that the providers are going to wait until there are a significant number of users that simply cannot function in any way without the subscription, and then jack up the prices.

elestor

2 months ago

yeah my 4GB of vram isn't gonna cut it

lucideng

2 months ago

A Mac dev type using a 5-year-old machine, I will believe it when I see it. I know a few older Macs still kicking around, but those people use them for basic stuff, not actual work. Mac people jump to new models faster than Taco Bell leaves my body.

artursapek

2 months ago

Imagine buying hardware that will be obsolete in 2 years instead of paying Anthropic $200 for $1000+ worth of tokens per month

selcuka

2 months ago

> Imagine buying hardware that will be obsolete in 2 years

Unless the PC you buy is more than $4,800 (24 x $200) it is still a good deal. For reference, a MacBook M4 Max with 128GB of unified RAM is $4,699. You need a computer for development anyway, so the extra you pay for inference is more like $2-3K.

Besides, it will still run the same model(s) at the same speed after that period, or even maybe faster with future optimisations in inference.

hu3

2 months ago

The value depreciation of the hardware alone is going to be significant. Probably enough to pay for 3x ~$20 subscriptions to OpenAI, Anthropic and Gemini.

Also, if you use the same mac to work, you can't reserve all 128GB for LLMs.

Not to mention a mac will never run SOTA models like Opus 4.5 or Gemini 3.0 which subscriptions gives you.

So unless you're ready to sacrifice quality and speed for privacy, it looks like a suboptimal arrangement to me.