simonw
3 days ago
I really feel this bit:
> With agentic coding, part of what makes the models work today is knowing the mistakes. If you steer it back to an earlier state, you want the tool to remember what went wrong. There is, for lack of a better word, value in failures. As humans we might also benefit from knowing the paths that did not lead us anywhere, but for machines this is critical information. You notice this when you are trying to compress the conversation history. Discarding the paths that led you astray means that the model will try the same mistakes again.
I've been trying to find the best ways to record and publish my coding agent sessions so I can link to them in commit messages, because increasingly the work I do IS those agent sessions.
Claude Code defaults to expiring those records after 30 days! Here's how to turn that off: https://simonwillison.net/2025/Oct/22/claude-code-logs/
I share most of my coding agent sessions through copying and pasting my terminal session like this: https://gistpreview.github.io/?9b48fd3f8b99a204ba2180af785c8... - via this tool: https://simonwillison.net/2025/Oct/23/claude-code-for-web-vi...
Recently been building new timeline sharing tools that render the session logs directly - here's my Codex CLI one (showing the transcript from when I built it): https://tools.simonwillison.net/codex-timeline?url=https%3A%...
And my similar tool for Claude Code: https://tools.simonwillison.net/claude-code-timeline?url=htt...
What I really want it first class support for this from the coding agent tools themselves. Give me a "share a link to this session" button!
vunderba
3 days ago
When I find myself in a situation where I’ve been hammering an LLM and it keeps veering down unproductive paths - trying poor solutions or applying fixes that make no difference but eventually we do arrive at the correct answer, the result is often a massive 100+ KB running context.
To help mitigate this in the future I'll often prompt:
“Why did it take so long to arrive at the solution? What did you do wrong?”
Then I follow up with: “In a single paragraph, describe the category of problem and a recommended approach for diagnosing and solving it in the future.”
I then add this summary to either the relevant MD file (CHANGING_CSS_LAYOUTS.md, DATA_PERSISTENCE.md, etc) or more generally to the DISCOVERIES.md file, which is linked from my CLAUDE.md under: - When resolving challenging directives, refresh yourself with: docs/DISCOVERIES.md - it contains useful lessons learned and discoveries made during development.
I don't think linking to an entire commit full of errors/failures is necessarily a good idea - feels like it would quickly lead to the proverbial poisoning of the well.itsgrimetime
3 days ago
Yep - this has worked well for me too. I do it a little differently:
I have a /review-sessions command & a "parse-sessions" skill that tells Claude how to parse the session logs from ~/.claude/projects/, then it classifies the issues and proposes new skills, changes to CLAUDE.md, etc. based on what common issues it saw.
I've tried something similar to DISCOVERIES.md (a structured "knowledge base" of assumptions that were proven wrong, things that were tried, etc.) but haven't had luck keeping this from getting filled with obvious things (that the code itself describes) or slightly-incorrect things, or just too large in general.
vunderba
3 days ago
100%. I like the idea of turning it into a SKILL.
I do have to perform more manual adjustments/consolidation to the final postmortem before placing it in the discoveries md file, because as you pointed out LLMs tend to be exceptionally verbose.
johnsmith1840
3 days ago
When you get stuck in a loop it's best to remove all code back to a point it didn't have problems. If you continue debugging in that hammering failure loop you get TONS of random future bugs.
anamexis
3 days ago
I've had good luck doing something like this first (but more specific to the issue at hand):
We are getting stuck in an unproductive loop. I am going to discard all of this work and start over from scratch. Write a prompt for a new coding assistant to accomplish this task, noting what pitfalls to avoid.
YesBox
3 days ago
Over time, do you think this process could lock you into an inflexible state?
I'm reminded of the trade off between automation and manual work. Automation crystalizes process, and thus the system as a whole loses it's ability to adapt in a dynamic environment.
simonw
3 days ago
Nothing about this feels inflexible to me at the moment - I'm evolving the way I use these tools on a daily basis, constantly discovering new tricks that work.
Just this morning I found out that I can tell Claude Code how to use my shot-scraper CLI tool to debug JavaScript and it will start doing exactly that:
you can run javascript against the page using:
shot-scraper javascript /tmp/output.html \
'document.body.innerHTML.slice(0, 100)'
- try that
Transcript: https://gistpreview.github.io/?1d5f524616bef403cdde4bc92da5b... - background: https://simonwillison.net/2025/Dec/22/claude-chrome-cloudfla...nosianu
3 days ago
We don't need automation for that, we "achieve" that through our processes already. Specifically, software creation processes of large teams with many changing developers over long periods. Example (but they are not the only one): https://news.ycombinator.com/item?id=18442941 -- changing or adding anything becomes increasingly burdensome.
I would like to post that every time somebody warns of the dangers of AI for maintainability. We are long past that point, long before AI. Businesses made the conscious decision that it is okay for quality to deteriorate, they'll squeeze profits from it for as long as possible and then they assume something new has already come along anyway. The few business still relying in that technical-debt-heavy product are still offered service, for large fees.
AI is just more of the same. When it becomes too hard to maintain they'll just create a new software product. Pretty much like other things in the material world work too, e.g. housing, or gadgets, or fashion. AI actually supports this even more, if new software can be created faster than old code can be maintained that's quite alright for the money-making oriented people. It is harder to sell maintenance than something new at least once every decade anyway.
CuriouslyC
3 days ago
You can export all agent traces to otel, either directly or via output logging. Then just dump it in clickhouse with metadata such as repo, git user, cwd, etc.
You can do evals and give agents long term memory with the exact same infrastructure a lot of people already have to manage ops. No need to retool, just use what's available properly.
btown
3 days ago
With great love to your comment, this has the same vibes as the infamous 2007 Dropbox comment: https://news.ycombinator.com/item?id=9224
I'd also argue that the context for an agent message is not the commit/release for the codebase on which it was run, but often a commit/release that is yet to be set up. So there's a bit of apples-to-oranges in terms of release tagging for the log/trace.
It's a really interesting problem to solve, because you could in theory try to retroactively find which LLM session, potentially from days prior, matches a commit that just hit a central repository. You could automatically connect the LLM session to the PR that incorporated the resulting code.
Though, might this discourage developers from openly iterating with their LLM agent, if there's a panopticon around their whole back-and-forth with the agent?
Someone can, and should, create a plug-and-play system here with the right permission model that empowers everyone, including the Programmer-Archaeologists (to borrow shamelessly from Vernor Vinge) who are brought in to "un-vibe the vibe code" and benefit from understanding the context and evolution.
But I don't think that "just dump it in clickhouse" is a viable solution for most folks out there, even if they have the infrastructure and experience with OTel stacks.
CuriouslyC
3 days ago
I get where you're coming from, having wrestled with Codex/CC to get it to actually emit everything needed to even do proper evals.
From a "correct solution" standpoint having one source of truth for evals, agent memory, prompt history, etc is the right path. We already have the infra to do it well, we just need to smooth out the path. The thing that bugs me is people inventing half solutions that seem rooted in ignorance or the desire to "capture" users, and seeing those solutions get traction/mindshare.
NeutralForest
3 days ago
I think we already have the tools but no the communication between those? Instead of having actions taken and failures as commit messages, you should have wide-events like logs with all the context, failures, tools used, steps taken... Those logs could be used as checkpoints to go back as well and you could refer back to the specific action ID you walked back to when encountering an error.
In turn, this could all be plain-text and be made accessible, through version control in a repo or in a central logging platform.
pigpop
3 days ago
I'm currently experimenting with trying to do this through documentation and project planning. Two core practices I use are a docs/roadmap/ directory with an ordered list of milestone documents and a /docs/retros/ directory with dated retrospectives for each session. I'm considering adding architectural decision records as a dedicated space for documenting how things evolve. The quote fta could be handled by the ADR records if they included notes on alternatives that were tried and why they didn't work as part of the justification for the decision that was made.
The trouble with this quickly becomes finding the right ones to include in the current working session. For milestones and retros it's simple: include the current milestone and the last X retros that are relevant but even then you may sometimes want specific information from older retros. With ADR documents you'd have to find the relevant ones somehow and the same goes for any other additional documentation that gets added.
There is clearly a need for some standardization and learning which techniques work best as well as potential for building a system that makes it easy for both you and the LLM to find the correct information for the current task.
inerte
3 days ago
Yes! 100% this. I was talking to friends about this and there's gotta be some value in the sessions leading to the commit. I doubt a human would them all while reviewing a PR, but some RAG tool could and then provide more context to another agent or session. Sometimes in a session I like to talk about previous commits and PRs and sessions, and I just wish this all was automatically done.
neutronicus
3 days ago
Emacs gptel just produces md or org files.
Of course the agentic capabilities are very much on a roll-your-own-in-elisp basis.
karthink
3 days ago
> agentic capabilities are very much on a roll-your-own-in-elisp basis
I use gptel-agent[1] when I want agentic capabilities. It includes tools and supports sub-agents, but I haven't added support for Claude skills folders yet. Rolling back the chat is trivial (just move up or modify the chat buffer), rolling back changes to files needs some work.
neutronicus
3 days ago
Oh, sick. Wasn't aware.
Don't think it's in Spacemacs yet but I'll have to try it out.
_alaya
3 days ago
Simon, I keep hoping that you will do one of your excellent reviews on Amp. It feels like the one 'major' agentic coding tool that is still flying under the radar. I intend to explore it myself of course but curious your take.
simonw
3 days ago
Amp, Cursor and OpenCode are the three that I'm most behind on I think. So many tools, so little time!
stacktraceyo
3 days ago
I’d like to make something like this but in the background. So I can better search my history of sessions. Basically start creating my own knowledge base of sorts
simonw
3 days ago
Running "rg" in your ~/.claude/ directory is a good starting point, but it's pretty inconvenient without a nicer UI for viewing the results.
the_mitsuhiko
3 days ago
Amp represents threads in the UI and an agent can search and reference its own history. That's for instance also how the handoff feature leverages that functionality. It's an interesting system and I quite like it, but because it's not integrated into either github or git, it is sufficiently awkward that I don't leverage it enough.
simonw
3 days ago
... this inspired me to try using a "rg --pre" script to help reformat my JSONL sessions for a better experience. This prototype seems to work reasonably well: https://gist.github.com/simonw/b34ab140438d8ffd9a8b0fd1f8b5a...
Use it like this:
cd ~/.claude/projects
rg --pre cc_pre.py 'search term here'kgwxd
3 days ago
> There is, for lack of a better word, value in failures
Learning? Isn't that what these things are supposedly doing?
simonw
3 days ago
LLMs notoriously don't learn anything - they reset to a blank slate every time you start a new conversation.
If you want them to learn you have to actively set them up to do that. The simplest mechanism is to use a coding agent tool like Claude Code and frequently remind it to make notes for itself, or to look at its own commit history, or to search for examples in the codebase that is available to it.
the_mitsuhiko
3 days ago
If by "these things" you mean large language models: they are not learning. Famously so, that's part of the problem.
mock-possum
3 days ago
No, we’re the ones who are learning.
There’s some utility to instructing them to ‘remember’ via writing to CLAUDE.md or similar, and instructing them to ‘recall’ by reading what they wrote later.
But they’ll rarely if even do it on their own.
agumonkey
3 days ago
there's some research into context layering so you can split / reuse previous chunks of context
ps: your context log apps are very very fun
ashot
3 days ago
Checkout codecast.sh
0_____0
3 days ago
"all my losses is lessons"