HP G9 Z2 Mini with a 20GB ADA 4000, 96GB RAM, 2TB SSD, Ubuntu. Would get a Macbook with a ton of RAM if I was buying today, a full form factor PC, the mini form factor looks nice but gets real hot and is hard to upgrade.
Tools: LM Studio for playing around with models, the ones I stabilize on for work go into ollama.
Models:
Qwen3 Coder 30b is the one I come back to most for coding tasks. It is decent in isolation but not so much at the multi-step, context-heavy agentic work that the hosted frontier models are pushing forward. Which is understandable.
I've found the smaller models (the 7B Qwen coder models, gpt-oss-20B, gemma-7b) extremely useful given they respond so fast (~80t/s for gpt-oss-20B on the above hardware), making them faster to get to an answer than Googling or asking ChatGPT (and fast to see if they're failing to answer so I can move on to something else).
Use cases:
Mostly small one-off questions (like 'what is the syntax for X SQL feature on Postgres', 'write a short python script that does Y') where the response comes back quicker than Google, ChatGPT, or even trying to remember it myself.
Doing some coding with Aider and a VS Code plugin (kinda clunky integration), but I quickly end up escalating anything hard to hosted frontier models (Anthropic, OpenAI via their clis or Cursor). I often hit usage limits on the hosted models so it's nice to have a way my dumbest questions don't burn tokens I want to reserve for real work.
Small LLM scripting tasks with dspy (simple categorization, CSV munging type tasks), sometimes larger RAG/agent type things with LangChain but it's a lot of overhead for personal scripts.
My company is building a software product that heavily utilizes LLMs so I often point my local dev environment at my local model (whatever's loaded, usually one of the 7B models), initially I did this not to incur costs but as prices have come down it's now more as it's less latency and I can test interface changes etc faster - especially as new thinking models can take a long time to respond.
It is also helpful to try and build LLM functions that work with small models as it means they run efficiently and portably on larger ones. One technical debt trap I have noticed with building for LLMs is that as large models get better you can get away with stuffing them with crap and still getting good results... up until you don't.
It's remarkable how fast things are moving in the local LLM world, right now the Qwen/gpt-oss models "feel" like gpt-3.5-turbo did a couple of years back which is remarkable given how groundbreaking (and expensive to train) 3.5 was and now you can get similar results on sub-$2k consumer hardware.
However, its very much still in the "tinkerer" phase where it's overall a net productivity loss (and massive financial loss) vs just paying $20/mo for a hosted frontier model.