_zoltan_
12 hours ago
"In this post, I’ll cover a third, not-so-obvious approach: building ways for the agent to validate more of its own work before a human has to step in. "
this has been an obvious thing to do since at least January (since Geoffrey Huntley published "everything is a ralph loop"), and this is how I've been working: build enough orchestration tooling to be able to automate everything: development container bringup, building it, running the unit tests, doing integration testing, and using the software as eventually an end user. then to iterate set performance goals on an already solid basis so the automated agent ("gym") can go and iterate autonomously, and let you know when it's "done".
I understand this probably does not work if you're on some subscription and not using the API (tokens burn fast), but this has been extremely productive for me.
dirtbag__dad
11 hours ago
You can get really far with the 20x Claude Code and Codex plans. They are many orders of magnitude cheaper than api calls.
_zoltan_
8 hours ago
Agreed! Until they fit. :-)
kami23
12 hours ago
This is where most of my productivity gains have come, I have a special harness I move from project to project now that does my testing orchestration, lots of my work day is setting up a prompt or two early and just letting them loop till they return evidence that the feature is working having gone through the big QA loop.
I've slowly been optimizing for token use through the stack and Claude ends up making very tight for loops for most of the process and keeping token count even lower. It's been nice. A lot of my toil at work is just gone.
osigurdson
11 hours ago
I can see how you could avoid regressions this way, but what do you add to your harness to prove that a new feature is working?
_zoltan_
9 hours ago
for us it's (usually) very easy as I work on performance optimization. a non-negligible part of this is correctness and verifiability, so we already have some of that.
to give you an example just recently I've coded a feature that for our shuffle operation can report which channel did the bytes flow through (as the PR giving us the plumbing underneath has landed upstream recently). what this basically means is that you run the shuffle, you know you've shuffled X bytes (because you have stats on both ends) and then you need to attribute them to different layers. on the first iteration, the count was off. the agent went, debugged, fixed, iterated, and then it was 1.5% off. again, it went, iterated, ... and now we're fine.
part of the task description was that the breakdown must match the known amount of bytes we're shuffling, so the agent took this upon as a self-verification point. so besides running our normal, boring unit tests, integration tests and end-to-end verification harnesses (which it not only has programmatic/cli/API access, but are documented in .md files for projects), it could use this criteria on top to verify.
looking at /usage, my API duration was 2h 43m, and on top of that:
claude-haiku-4-5: 2.7k input, 115.3k output, 16.3m cache read, 867.9k cache write ($3.30)
claude-opus-4-8: 46.9k input, 555.0k output, 166.6m cache read, 2.9m cache write ($115.77)osigurdson
8 hours ago
Definitely agree that performance optimization is a good use case for LLMs. Here you have both a measurable goal / objective function and guardrails against functional regressions. It kind of closes the loop in that regard.
One thing however is a test suite is not usually exhaustive in the sense that any code that passes the tests is valid. Usually tests are more complimentary in nature. Therefore you could still possibly get code degradation, potentially.
jaggederest
8 hours ago
> One thing however is a test suite is not usually exhaustive in the sense that any code that passes the tests is valid. Usually tests are more complimentary in nature.
Not in the world of AI - if your tests don't catch any known issues, the problem is the tests aren't comprehensive enough. There's no excuse at this point not to have an incredibly comprehensive test suite, to go with your other agent feedback loop constraints
osigurdson
6 hours ago
>> if your tests don't catch any known issues, the problem is the tests aren't comprehensive enough.
Maybe I misunderstand but this seems like a fairly low bar in the test suite only covers existing bugs.
I'd argue that if you aren't going to look at the code you actually need a fully comprehensive test suite - in the sense that if the tests pass, the code is correct and you don't have to look at it at all. The problem is, that isn't very quick to create it seems. Of course, if there is a way to do it quickly in a way that is reproducible by others I'd love to hear about it.
jaggederest
6 hours ago
I don't mean just bugs, I mean any known issues. I test infra, I test UI, I test binary protocols, you name it. There is certainly no fast way to do it, even with AI (an AI generated suite is better than nothing but not as good), and it's a serious investment, but it's worth it. Testing becomes a process of correctness checking that snowballs over time, making everything else easier and better (or else the tests need further adjustment!)
osigurdson
5 hours ago
Right. You mean all behaviors are tested, essentially.
So if you / team are going to implement a new feature, what does that look like? Do you write Gherkin or similar, unit tests or both? Can you provide an example of what that might look like? How much of this has changed for you since the pre-AI days?
jaggederest
an hour ago
These days, yes, integration test at the high level (usually a 1-to-3 liner), then unit tests as I go, often some mocked functional tests. This is basically the same but a ton faster in the AI days, you have to hold the AI accountable and demand quality and iterate, but this weekend I've built an entire test suite for a monorepo I just started working on. It's garbage quality but better than no tests, of course, and will improve as I work.
You can find some open source examples on github, either directly https://github.com/pgdogdev/pgdog/commits/main/?author=jagge... or through my profile - that repo has a pure-sql integration suite I wrote essentially entirely with AI: https://github.com/pgdogdev/pgdog/tree/main/integration/sql
There's also older work on github you can see over the years, a mishmash and grab bag, I would prefer if more of my work were open source but somehow most employers still default to closed source
Edit: While I'm thinking about it, the other thing you can do with AI is demand that it TDD things - I'm more of a "test all the fucking time" adherent, I don't care whether the tests are written first, but AI is perfectly happy to skate by making a tautological test unless you make it write the test first, ensure it fails correctly, make your change, and don't let it modify the test.
kami23
10 hours ago
I have it record a series of gifs or videos that I look over. If something looks off I'll dig into it, but I break down work into very very small chunks that are usually easily verifiable or don't require multiple steps.
Another thing I have in the general sdlc process is having it add enough logging to verify features are turned on, configured as we expected, and that becomes enough feedback for most of my features.
I've been mostly focusing on being able to replicate this across stacks greater than 3 projects so far (with the eventual goal of having an agent be able to orchestrate our complete infra stack, and this being a large component of a DR plan to rebuild).
None of this is really new for us, I'm just the most knowledgeable in my group in how the different products across teams glue together so I've been creating these rube goldbergs as a prototype, and then having it iterate on codifying the parts that don't need a constant LLM. We were blessed to have an engineer a decade ago build out tooling for local container automation that matches 95% of the deployed infra stack. That last 5% sucks when you fall into it, but that's always been a truth. I've added and expanded the tool over the years with making it act more like the deployed environment networking wise, but a lot of things don't end up working well in docker containers on M series macs when most of our complicated virtualization in our private cloud can't run on them yet...
pstorm
10 hours ago
I’ve been building this out too, and your comment made me realize the missing piece for me. I’ve given the agents tools to validate its own work, but I haven’t improved the experience of humans verifying the agents’ work.
kami23
7 hours ago
For video/image stuff I found the ability for the LLMs to use ffmpeg and imagemagick to be quite fun.
lovich
2 hours ago
What’s the cost of all that though? I don’t doubt that productivity could be gained but when I see articles like the one on the Open Claw guy spending 1.3 million on tokens in a single month I am reminded of drag racing engines that can reach incredible speeds but also need to be completely rebuilt after a single race.
psychoslave
12 hours ago
What license do you use then?
_zoltan_
12 hours ago
you can pay by just volume ("API pricing")