hackernews client

Tell HN: I cut Claude API costs from $70/month to pennies

40 pointsposted 13 days ago

Item id: 46760285

30 Comments

LTL_FTC

13 days ago

It sounds like you don’t need immediate llm responses and can batch process your data nightly? Have you considered running a local llm? May not need to pay for api calls. Today’s local models are quite good. I started off with cpu and even that was fine for my pipelines.

kreetx

12 days ago

Though haven't done any extensive testing then I personally could easily get by with current local models. The only reason I don't is that the hosted ones all have free tiers.

queenkjuul

12 days ago

Agreed, I'm pretty amazed at what I'm able to do locally just with an AMD 6700XT and 32GB of RAM. It's slow, but if you've got all night...

ok_orco

12 days ago

I haven't thought about that, but really want to dig in more now. Any places you recommend starting?

LTL_FTC

10 days ago

I started off using gpt-oss-120b on cpu. It uses about 60-65gb of memory or so but my workstation has 128gb of ram. If I had less ram, I would start off with the gpt-oss-20b model and go from there. Look for MoE models as they are more efficient to run.

My old threadripper pro was seeing about 15tps, which was quite acceptable for the background tasks I was running.

ydu1a2fovb

12 days ago

Can you suggest any good llms for cpu?

LTL_FTC

10 days ago

R_D_Olivaw

12 days ago

Following.

LTL_FTC

10 days ago

Aerbil313

7 days ago

Hey Olivaw, saw a comment of yours asking about planners. Wanted to reply but it’s expired. Check out bullet journalling.

R_D_Olivaw

4 days ago

Thanks for the reply!

Bullet journaling is neat, but I'm far too whacky with my notes to stick to that kind of structure.

I have various other structures I implement, but they're just hodge podges of things.

44za12

13 days ago

This is the way. I actually mapped out the decision tree for this exact process and more here:

https://github.com/NehmeAILabs/llm-sanity-checks

homeonthemtn

12 days ago

That's interesting. Is there any kind of mapping to these respective models somewhere?

44za12

12 days ago

Yes, I included a 'Model Selection Cheat Sheet' in the README (scroll down a bit).

I map them by task type:

Tiny (<3B): Gemma 3 1B (could try 4B as well), Phi-4-mini (Good for classification). Small (8B-17B): Qwen 3 8B, Llama 4 Scout (Good for RAG/Extraction). Frontier: GPT-5, Llama 4 Maverick, GLM, Kimi

Is that what you meant?

hyuuu

8 days ago

at the sake of being obvious, do you have a tiny llm gating this decision and classifying and directing the task to its appropriate solution?

andai

6 days ago

>Before you reach for a frontier model, ask yourself: does this actually need a trillion-parameter model?

>Most tasks don't. This repo helps you figure out which ones.

About a year ago I was testing Gemini 2.5 Pro and Gemini 2.5 Flash for agentic coding. I found they could both do the same task, but Gemini Pro was way slower and more expensive.

This blew my mind because I'd previously been obsessed with "best/smartest model", and suddenly realized what I actually wanted was "fastest/dumbest/cheapest model that can handle my task!"

gandalfar

13 days ago

Consider using z.ai as model provider to further lower your costs.

andai

6 days ago

Do you mean with the coding plan?

13 days ago

Pretty straightforward. Sources dump into a queue throughout the day, regex filters the obvious junk ("lol", "thanks", bot messages never hit the LLM), then everything gets batched overnight through Anthropic's Batch API for classification. Feedback gets clustered against existing pain points or creates new ones.

Most of the cost savings came from not sending stuff to the LLM that didn't need to go there, plus the batch API is half the price of real-time calls.

dezgeg

13 days ago

Are you also adding the proper prompt cache control attributes? I think Anthropic API still doesn't do it automatically

ok_orco

11 days ago

No I need to look into this!