lmeyerov
21 hours ago
Something I would add is planning. A big "aha" for effective use of these tools is realizing they run on dynamic TODO lists. Ex: Plan mode is basically bootstrapping how that TODO list gets seeded and how todos ground themselves when they get reached, and user interactions are how you realign the todo lists. The todolist is subtle but was a big shift in coding tools, and many seem to be surprised when we discuss it -- most seem to focus on whether to use plan mode or not, but todo lists will still be active. I ran a fun experiment last month on how well claude code solves CTFs, and disabling the TodoList tool and planning is 1-2 grade jumps: https://media.ccc.de/v/39c3-breaking-bots-cheating-at-blue-t... .
Fwiw, I found it funny how the article stuffs "smarter context management" into a breeze-y TODO bullet point at the end for going production-grade. I've been noticing a lot of NIH/DIY types believing they can do a good job of this and then, when forced to have results/evals that don't suck in production, losing the rest of the year on that step. (And even worse when they decide to fine-tune too.)
btown
20 hours ago
I'm unsure of its accuracy/provenance/outdatedness, but this purportedly extracted system prompt for Claude Code provides a lot more detail about TODO iteration and how powerful it can be:
https://gist.github.com/wong2/e0f34aac66caf890a332f7b6f9e2ba...
https://gist.github.com/wong2/e0f34aac66caf890a332f7b6f9e2ba...
I find it fascinating that while in theory one could just append these as reasoning tokens to the context, and trust the attention algorithm to find the most recent TODO list and attend actively to it... in practice, creating explicit tools that essentially do a single-key storage are far more effective and predictable. It makes me wonder how much other low-hanging fruit there is with tool creation for storing language that requires emphasis and structure.
lmeyerov
20 hours ago
I find in coding + investigating there's a lot of mileage to being fancier on the todo list. Eg, we make sure timestamps, branches, outcomes, etc are represented. It's impressive how far they get with so little!
For coding, I actually fully take over the todo list in codex + claude: https://github.com/graphistry/pygraphistry/blob/master/ai/pr...
In Louie.ai, for investigations, we're experimenting with enabling more control of it, so you can go with the grain, vs that kind of wholecloth replacement
btown
20 hours ago
Ooh, am I reading correctly that you're using the filesystem as the storage for a "living system prompt" that also includes a living TODO list? That's pretty cool!
And on a separate note - it looks like you're making a system for dealing with graph data at scale? Are you using LLMs primarily to generate code for new visualizations, or also to reason directly about each graph in question? To tie it all together, I've long been curious whether tools can adequately translate things from "graph space" to "language space" in the context of agentic loops. There seems to be tremendous opportunity in representing e.g. physical spaces as graphs, and if LLMs can "imagine" what would happen if they interacted with them in structured ways, that might go a long way towards autonomous systems that can handle truly novel environments.
lmeyerov
19 hours ago
yep! So all repos get a (.gitignore'd) folder of `plans/<task>/plan.md` work histories . That ends up being quite helpful in practice: calculating billable hours of work, forking/auditing/retrying, easier replanning, etc. At the same time, I rather be with-the-grain of the agentic coder's native systems for plans + todos, eg, alignment with the models & prompts. We've been doing this way b/c we find the native to be weaker than what these achieve, and to hard to add these kind of things to them.
RE:Other note, yes, we have 2 basic goals:
1. Louie to make graphs / graphistry easier. Especially when connected to operational databases (splunk, kusto, elastic, big query, ...). V1 was generating graphistry viz & GFQL queries. We're now working on louie inside of graphistry, for more dynamic control of the visual analysis environment ("filter to X and color Y as Z"), and as you say, to go straight to the answer too ("what's going on with account/topic X"). We spent years trying to bring jupyter notebooks etc to operational teams as a way to get graph insights to their various data, and while good for a few "data 1%'ers", too hard for most, and Louie has been a chance to rethink that.
2. Louie has been seeing wider market interest beyond graph, basically "AI that investigates" across those operational DBs (& live systems). You can think of it as vibe coding is code-oriented, while louie is vibe investigating that is more data-oriented. Ex: Native plans don't think in unit tests but cross-validation, and instead of grepping 1,000 files, we get back a dataframe of 1M query results and pass that between the agents for localized agentic retrieval on that vs rehammering db. The CCC talk gives a feel for this in the interactive setting.
jodleif
6 hours ago
For humans org-mode is good at this
homarp
19 hours ago
aren't the system prompt of Claude public in the doc at https://platform.claude.com/docs/en/release-notes/system-pro... ?
Stagnant
16 hours ago
The system prompt of claude code changes constantly. I use this site to see what has changed between versions: https://cchistory.mariozechner.at/
It is a bit weird why anthropic doesn't make that available more openly. Depending on your preferences there is stuff in the default system prompt that you may want to change.
I personally have a list of phrases that I patch out from the system prompt after each update by running sed on cc's main.js
handfuloflight
15 hours ago
What are those phrases? Why do you exclude them?
btown
18 hours ago
This is for Claude Code, not just Claude.
what
17 hours ago
From elsewhere in that prompt:
> Only use emojis if the user explicitly requests it. Avoid adding emojis to files unless asked.
When did they add this? Real shame because the abundance of emojis in a readme was a clear signal of slop.
rrvsh
19 hours ago
I've had a LOT of success keeping a "working memory" file for CLI agents. Currently testing out Codex now, and what I'll do is spend ~10mins hashing out the spec and splitting it into a list of changes, then telling the agent to save those changes to a file and keep that file updated as it works through them. The crucial part here is to tell it to review the plan and modify it if needed after every change. This keeps the LLM doing what it does best (short term goals with limited context) while removing the need to constantly prompt it. Essentially I feel like it's an alternative to having subagents for the same or a similar result
tacone
6 hours ago
I use a folder for each feature I add. The LLM is only allowed to output markdown file in the output subfolder (of course it doesn't always obey, but it still limits pollution in the main folder)
The folder will contain a plan file and a changelog. The LLM is asked to continously update the changelog.
When I open a new chat, I attach the folder and say: onboard yourself on this feature then get back to me.
This way, it has context on what has been done, the attempts it did (and perhaps failed), the current status and the chronological order of the changes (with the recent ones being usually considered more authoritative)
fastball
18 hours ago
Planning mode actually creates whole markdown files, then wipes the context that was required to create that plan before starting work. Then it holds the plan at the system prompt level to ensure it remains top of mind (and survives unaltered during context compaction).
ramoz
7 hours ago
I don’t think it wipes the context window.
veselin
12 hours ago
I run evals and the Todo tool doesn't help most of the time. Usually models on high thinking would maintain Todo/state in their thinking tokens. What Todo helps is for cases like Anthropic models to run more parallel tool calls. If there is a Todo list call, then some of the actions after are more efficient.
What you need to do is to match the distribution of how the models were RL-ed. So you are right to say that "do X in 200 lines" is a very small part of the job to be done.
lmeyerov
3 hours ago
Curious what kinds of evals you focus on?
We're finding investigating to be same-but-different to coding. Probably the most close to ours that has a bigger evals community is AI SRE tasks.
Agreed wrt all these things being contextual. The LLM needs to decide whether to trigger tools like self-planning and todo lists, and as the talk gives examples of, which kind of strategies to use with them.
shnpln
20 hours ago
Oh yes, I commonly add something like "Use a very granular todo list for this task" at the end of my prompts. And sometimes I will say something like "as your last todo, go over everything you just did again and use a linter or other tools to verify your work is high quality"
hadlock
19 hours ago
Right now I start chatting with a separate LLM about the issue, the best structure for maintainability and then best libraries for the job, edge and corner cases, how to handle those, and then have it spit out a prompt and a checklist. If it has a UI I'll draw something in paint and refine it before having the LLM describe it in detail and primary workflow etc. and tell it to format it for an agent to use. That will usually get me a functional system on the first try which can then be iterated on.
That's for complicated stuff. For throw-away stuff I don't need to maintain past 30 days like a script I'll just roll the dice and let it rip.
shnpln
18 hours ago
Yeah, this is a good idea. I will have a Claude chat session and a Claude Code session open side by side too.
Like a manual sub agents approach. I try not to pollute the Claude code session context with meanderings to much. Do that in the chat and bring the condensed ideas over.
renjimen
19 hours ago
If you have pre-commit hooks it should do this last bit automatically, and use your project settings
shnpln
19 hours ago
Yes, I do. But it does not always use them when I change contexts. I just get in the habit of saying it. Belt and suspenders approach.
jcims
7 hours ago
Mind if I ask what models you’re using for CTF? I got out of the game about ten years ago and have been recently thinking about doing my toes back in.
lmeyerov
3 hours ago
Yep -- one fun experiment early in the video is showing sonnet 4.5 -> opus 4.5 gave a 20% lift
We do a bit of model-per-task, like most calls are sending targeted & limited context fetches into faster higher-tier models (frontier but no heavy reasoning tokens), and occasional larger data dumps (logs/dataframes) sent into faster-and-cheaper models. Commercially, we're steering folks right now more to openai / azure openai models, but that's not at all inherent. OpenAI, Claude, and Gemini can all be made to perform well here using what the talk goes over.
Some of the discussion earlyish in the talk and Q&A after is on making OSS models production-grade for these kinds of investigation tasks. I find them fun to learn on and encourage homelab experiments, and for copilots, you can get mileage. For more heavy production efforts, I typically do not recommend them for most teams at this time for quality, speed, practicality, and budget reasons if they have the option to go with frontier models. However, some bigger shops are doing it, and I'd be happy to chat how we're approaching quality/speed/cost there (and we're looking for partners on making this easier for everyone!)
jcims
2 hours ago
Nice! Thank you!
I just did an experiment yesterday with Opus 4.5 just operating in agent mode in vscode copilot. Handed it a live STS session for AWS to see if it could help us troubleshoot an issue. It was pretty remarkable seeing it chop down the problem space and arrive at an accurate answer in just a few mins.
I'll definitely check out the video later. Thanks!
matchagaucho
20 hours ago
The TODO lists are also frequently re-inserted into the context HEAD to keep the LLM aware of past and next steps.
And in the event of context compression, the TODO serves as a compact representation of the session.
bdangubic
20 hours ago
at the end of the year than you get “How to Code Claude Code in 200 Million Lines of Code” :)