cityofdelusion
2 hours ago
I am glad articles like this are finally starting to get some momentum around what I call the LLM magic box industry. From caveman mode to RTK to semantic search and everything in between. Developers have become magicians that cast spells instead of engineers. It sucks at work especially with everyone so sure that their magic spell is the one for ultimate token savings.
My criteria are: if it’s not in a harness it’s probably not that good (the best ideas float up to Codex/Claude imo) and any GitHub advertising some percent of token savings is not to be trusted.
It’s hard to avoid the snake oil and I hope people start thinking critically on this stuff.
AndyNemmity
18 minutes ago
This is why I Blind A/B test everything.
I burn a ton of tokens, but things actually have to prove their value. And the vast majority of things do not come close to doing so.
I have my own AI agent full of stuff. I blind A/B test everything, but I also don't think the results are all that useful as a signal to others.
Just because I Blind A/B test it 4 months ago, it's maybe not meaningful today.
Maybe the word choices I use dramatically impact things.
I do it, because I can prove the value, and see it with my own eyes. I don't even bother publishing the specific Blind A/B tests.
Also, I've seen other people try to Blind A/B test and get it very wrong. If your measurements aren't good, the test is meaningless.
I don't know. We're all working on these problems together. There's a lot of black magic (which is why I rely on hooks a lot). I'm sure I have tons of black magic, I have a large little AI Agent.
But what I know for certain, is it works for me. All it takes is for me to not use it, and I honestly don't know how everyone currently works with AI.
I will link it, but it is not an endorsement for what you do. Mostly only other software engineers use it. And it's so very specific to the things I have to do.
At best, maybe it sparks an idea for you to implement on your own.
philipbjorge
18 minutes ago
I'll go further and note that some of the optimizations I've seen in rtk for things like `git status` have actually bubbled up into the model layer -- Codex is regularly making tool calls like `git status --short` instead of `git status`.
arcanemachiner
2 hours ago
The idea itself is sound: If you can reduce the signal-to-noise ratio in the context window, then that's a good thing.
Whether or not RTK actually does this has not been established. I would be glad to see some proper benchmarks done on the actual difference this tool makes (not some meaningless "up to 90%" type of language).
lackoftactics
2 hours ago
I was wondering if that impacts the accuracy, obviously the rtk output wasn't in the training dataset, but maybe it doesn't matter at the end
lackoftactics
43 minutes ago
I have to say I made a similiar mistake with trusting semantic search is the next big thing. My opinion shifted, but it made sense for me for too long
striking
an hour ago
I mean it kind of already is in harnesses. Codex and Claude Code both have subagent tools. You could probably get a similar token output cut just by asking Claude Code to run all commands with Haiku as a summarizing subagent.
blubber
2 hours ago
There is a conflict of interest, though.