CuriouslyC
4 days ago
The most important thing is to have a strong plan cycle in front of you agent work, if you do that, agents are very reliable. You need to have a deep research cycle that basically collects a covering set of code that might need to be modified for a feature, feeds it into gemini/gpt5 to get a broad codebase level understanding, then has a debate cycle on how to address it, with the final artifact being a hyper detailed plan that goes file by file and provides an outline of changes required.
Beyond this, you need to maintain good test coverage, and you need to have agents red-team your tests aggressively to make sure they're robust.
If you implement these two steps your agent performance will skyrocket. The planning phase will produce plans that claude can iterate on for 3+ hours in some cases, if you tell it to complete the entire task in one shot, and the robust test validation / change set analysis will catch agents solving an easier problem because they got frustrated or not following directions.
skydhash
4 days ago
By that point I would have already produced the 20 line diff for the ticket. Huge commits (or change requests) are usually scaffolding, refactoring, or design changes to support new features. You also got generated code and verbose language like CSS. So stuff where the more knowledge you have about the code, the faster you can be.
The daily struggle was always those 10 line diffs where you have to learn a lot (from the stakeholder, by debugging, from the docs).
CuriouslyC
4 days ago
A deep plan cycle will find stuff like this, because it's looking at the whole relevant portion of your codebase at once (and optionally the web, your internal docs, etc). It'll just generate a very short plan for the agent.
The important thing is that this process is entirely autonomous. You create an issue, that hooks the planners, the completion of a plan artifact hooks a test implementer, the completion of tests hooks the code implementer(s, with cheaper models generating multiple solutions and taking the best diff works well), the completion of a solution + PR hooks code+security review, test red teaming, etc.
bit_bear
a day ago
What do those hooks look like low level? A script polling against some ticket queue triggers the planner. Is the hand off done by using watchman which triggers agents on .md files dropped in certain directories?
rapind
3 days ago
> The planning phase will produce plans that claude can iterate on for 3+ hours in some cases, if you tell it to complete the entire task in one shot, and the robust test validation / change set analysis will catch agents solving an easier problem because they got frustrated or not following directions.
Don't you run into context nightmares though? I was coming up with very detailed plans (using zen to vet with other models), but I found claude just doing the wrong thing a lot of the time, ignoring and / or forgetting very specific instructions and rules, especially across context compactions.
There's this one time that really sticks out in my mind because I had to constantly correct it; when to use ->> versus -> and handle null / type checks with PostgreSQL JSONB. Vibe coders would miss this sort of thing with testing unless they knew that JSONB null is not the same as SQL NULL (and other types too). When working with nested data, you probably won't have test coverage for it. This is just one of many examples too.
MndlshnDscpl
3 days ago
Agreed 100%. For those of us who have already spent ungodly hours creating hyper-detailed specifications for AI, the take that this is the solution to working with AI coding agents seems ridiculously naive. For context, I've also seen this behavior in Claude Code, and despite initially being extremely bullish on the technology, it's almost convinced me that it just isn't ready for prime time no matter what the hucksters might tell you. When you start seeing this you quickly realize that it doesn't really matter how many guardrails you put in place, or how detailed your specification is, if your coding agent randomly decides to ignore your rules or specifications(even in 'brand new context' scenarios). I've lost track of how many times I've asked Claude why did you do this, when it expressly says to do the opposite in the Claude.md file(including words like 'important' or 'critical'), or a specification document that you read right before implementing with a brand new context. Naturally, Claude's reply will be some variation of 'You're absolutely right to call me out on this. I should have done it the way it was spelled out in the specification.'
CuriouslyC
3 days ago
I have tripwires in my codebase for when Claude tries to run benchmarks with mock/synthetic data because it had a hard time getting the benchmark to run and decided to yeet it, to avoid potential scientific credibility issues, LOL. You can put the system on rails, but it's an engineering problem, these things are noisy program emitters with some P(correct|context), you can model them as noisy channels and use the same error correcting codes to create channels with arbitrarily low noise.
CuriouslyC
3 days ago
The key is to have each step have very detailed instructions, and tell claude to dispatch the appropriate domain expert subagent for each step with the specific instructions for that step. That keeps root context hot, and each subagent only gets the instructions it needs and has a fresh context.