7777777phil
14 days ago
Earlier this month I argued why LLMs need episodic memory (https://philippdubach.com/posts/beyond-vector-search-why-llm...), and this lines up closely with what you’re describing..
But not sure it's a prompts vs rules problem. It’s more about remembering past decisions as decisions. Things like 'we avoided this dependency because it caused trouble before' or 'this validation exists due to a past incident' have sequence and context. Flattening them into embeddings or rules loses that. I see even the best models making those errors over a longer context rn.
My current view is that humans still need to control what gets remembered. Models are good at executing once context is right, but bad at deciding what deserves to persist.
qazxcvbnmlp
14 days ago
Humans decide what to remember based on their emotions. The LLM’s don’t have emotions. Our definition of good and bad comes from our morals and emotions.
In the context of software development; requirements are based on what we wanna do (which is based on emotion), the methods we choose to implement it are also based mostly on our predictions about what will work and not work well.
Most of our affinity for good software development hygiene comes from emotional experiences of the negative feelings felt from the extra work of bad development hygiene.
I think this explains a lot of varied success with coding agents. You don’t talk to them like you talk to an engineer because with an engineer, you know that they have a sense of what is good and bad. Coding agents won’t tell you what is good and bad. They have some limited heuristics, but they don’t understand nuance at all unless you prompt them on it.
Even if they could have unlimited context, window and memory, they would still need to be able to which part of that memories is important. I.e. if the human gave them conflicting instructions, how do they resolve that?.
I eventually think we’ll get to a state where a lot of the mechanics of coding and development can be incorporated into coding agents, but the what and why we build will still come from a human. I.e. will be able to do from 0 to 100% by itself a full stack web application, including deployment with all the security compliance and logins and whatever else, but it still won’t know what is important to emphasize in that website. Should the images be bigger here or the text? Questions like that.
dabaja
13 days ago
Interesting framing, but I think emotions are a proxy for something more tractable: loss functions over time. Engineers remember bad hygiene because they've felt the cost. You can approximate this for agents by logging friction: how many iterations did a task take, how many reverts, how much human correction. Then weight memory retrieval by past-friction-on-similar-tasks. It's crude, but it lets the agent "learn" that certain shortcuts are expensive without needing emotions. The hard part is defining similarity well enough that the signal transfers. Still early, but directionally this has reduced repeat mistakes in our pipeline more than static rules did.
qazxcvbnmlp
13 days ago
How do you choose which loss function over time to pursue?
dabaja
10 days ago
Honestly, it's empirical. We started with what was easiest to measure: human correction rate. If I had to step in and fix something, that's a clear signal the agent took a bad path. Iterations and reverts turned out to be noisier -- sometimes high iteration count means the task was genuinely hard, not that the agent made a mistake. So we downweighted those. The meta-answer is: pick the metric that most directly captures "I wish the agent hadn't done that." For us that's human intervention. For a team with better test coverage, it might be test failures after commit. For infra work, maybe rollback frequency. There's no universal loss function — it depends on where your pain actually is. We just made it explicit and started logging it. The logging alone forced clarity.