hackernews client

How to code Claude Code in 200 lines of code

816 pointsposted a month ago

248 Comments

lmeyerov

a month ago

Something I would add is planning. A big "aha" for effective use of these tools is realizing they run on dynamic TODO lists. Ex: Plan mode is basically bootstrapping how that TODO list gets seeded and how todos ground themselves when they get reached, and user interactions are how you realign the todo lists. The todolist is subtle but was a big shift in coding tools, and many seem to be surprised when we discuss it -- most seem to focus on whether to use plan mode or not, but todo lists will still be active. I ran a fun experiment last month on how well claude code solves CTFs, and disabling the TodoList tool and planning is 1-2 grade jumps: https://media.ccc.de/v/39c3-breaking-bots-cheating-at-blue-t... .

Fwiw, I found it funny how the article stuffs "smarter context management" into a breeze-y TODO bullet point at the end for going production-grade. I've been noticing a lot of NIH/DIY types believing they can do a good job of this and then, when forced to have results/evals that don't suck in production, losing the rest of the year on that step. (And even worse when they decide to fine-tune too.)

btown

a month ago

I'm unsure of its accuracy/provenance/outdatedness, but this purportedly extracted system prompt for Claude Code provides a lot more detail about TODO iteration and how powerful it can be:

https://gist.github.com/wong2/e0f34aac66caf890a332f7b6f9e2ba...

I find it fascinating that while in theory one could just append these as reasoning tokens to the context, and trust the attention algorithm to find the most recent TODO list and attend actively to it... in practice, creating explicit tools that essentially do a single-key storage are far more effective and predictable. It makes me wonder how much other low-hanging fruit there is with tool creation for storing language that requires emphasis and structure.

lmeyerov

a month ago

I find in coding + investigating there's a lot of mileage to being fancier on the todo list. Eg, we make sure timestamps, branches, outcomes, etc are represented. It's impressive how far they get with so little!

For coding, I actually fully take over the todo list in codex + claude: https://github.com/graphistry/pygraphistry/blob/master/ai/pr...

In Louie.ai, for investigations, we're experimenting with enabling more control of it, so you can go with the grain, vs that kind of wholecloth replacement

btown

a month ago

Ooh, am I reading correctly that you're using the filesystem as the storage for a "living system prompt" that also includes a living TODO list? That's pretty cool!

And on a separate note - it looks like you're making a system for dealing with graph data at scale? Are you using LLMs primarily to generate code for new visualizations, or also to reason directly about each graph in question? To tie it all together, I've long been curious whether tools can adequately translate things from "graph space" to "language space" in the context of agentic loops. There seems to be tremendous opportunity in representing e.g. physical spaces as graphs, and if LLMs can "imagine" what would happen if they interacted with them in structured ways, that might go a long way towards autonomous systems that can handle truly novel environments.

lmeyerov

a month ago

yep! So all repos get a (.gitignore'd) folder of `plans/<task>/plan.md` work histories . That ends up being quite helpful in practice: calculating billable hours of work, forking/auditing/retrying, easier replanning, etc. At the same time, I rather be with-the-grain of the agentic coder's native systems for plans + todos, eg, alignment with the models & prompts. We've been doing this way b/c we find the native to be weaker than what these achieve, and to hard to add these kind of things to them.

RE:Other note, yes, we have 2 basic goals:

1. Louie to make graphs / graphistry easier. Especially when connected to operational databases (splunk, kusto, elastic, big query, ...). V1 was generating graphistry viz & GFQL queries. We're now working on louie inside of graphistry, for more dynamic control of the visual analysis environment ("filter to X and color Y as Z"), and as you say, to go straight to the answer too ("what's going on with account/topic X"). We spent years trying to bring jupyter notebooks etc to operational teams as a way to get graph insights to their various data, and while good for a few "data 1%'ers", too hard for most, and Louie has been a chance to rethink that.

2. Louie has been seeing wider market interest beyond graph, basically "AI that investigates" across those operational DBs (& live systems). You can think of it as vibe coding is code-oriented, while louie is vibe investigating that is more data-oriented. Ex: Native plans don't think in unit tests but cross-validation, and instead of grepping 1,000 files, we get back a dataframe of 1M query results and pass that between the agents for localized agentic retrieval on that vs rehammering db. The CCC talk gives a feel for this in the interactive setting.

jodleif

a month ago

For humans org-mode is good at this

homarp

a month ago

aren't the system prompt of Claude public in the doc at https://platform.claude.com/docs/en/release-notes/system-pro... ?

Stagnant

a month ago

The system prompt of claude code changes constantly. I use this site to see what has changed between versions: https://cchistory.mariozechner.at/

It is a bit weird why anthropic doesn't make that available more openly. Depending on your preferences there is stuff in the default system prompt that you may want to change.

I personally have a list of phrases that I patch out from the system prompt after each update by running sed on cc's main.js

handfuloflight

a month ago

What are those phrases? Why do you exclude them?

btown

a month ago

This is for Claude Code, not just Claude.

what

a month ago

From elsewhere in that prompt:

> Only use emojis if the user explicitly requests it. Avoid adding emojis to files unless asked.

When did they add this? Real shame because the abundance of emojis in a readme was a clear signal of slop.

rrvsh

a month ago

I've had a LOT of success keeping a "working memory" file for CLI agents. Currently testing out Codex now, and what I'll do is spend ~10mins hashing out the spec and splitting it into a list of changes, then telling the agent to save those changes to a file and keep that file updated as it works through them. The crucial part here is to tell it to review the plan and modify it if needed after every change. This keeps the LLM doing what it does best (short term goals with limited context) while removing the need to constantly prompt it. Essentially I feel like it's an alternative to having subagents for the same or a similar result

tacone

a month ago

I use a folder for each feature I add. The LLM is only allowed to output markdown file in the output subfolder (of course it doesn't always obey, but it still limits pollution in the main folder)

The folder will contain a plan file and a changelog. The LLM is asked to continously update the changelog.

When I open a new chat, I attach the folder and say: onboard yourself on this feature then get back to me.

This way, it has context on what has been done, the attempts it did (and perhaps failed), the current status and the chronological order of the changes (with the recent ones being usually considered more authoritative)

fastball

a month ago

Planning mode actually creates whole markdown files, then wipes the context that was required to create that plan before starting work. Then it holds the plan at the system prompt level to ensure it remains top of mind (and survives unaltered during context compaction).

ramoz

a month ago

I don’t think it wipes the context window.

matchagaucho

a month ago

The TODO lists are also frequently re-inserted into the context HEAD to keep the LLM aware of past and next steps.

And in the event of context compression, the TODO serves as a compact representation of the session.

dboon

a month ago

I’m a DIY (or, less generously and not altogether inaccurately, NIH) type who thinks he could do a good job of smarter context management. But, I have no particular reason to know better than anyone else. Tell me more. What have you seen? What kinds of approaches? Who’s working on it?

lmeyerov

a month ago

I'm optimistic most people can, given the time and resources

In the CCC video, you may enjoy the section on how we are moving to eval-driven AI coding for how we more methodically improve agents. Even more so, the slides before on motivating why it gets harder to improve quality as you go on.

One big rub is it's one of those areas where people grossly misunderestimate what is needed for the quality goals they're likely targeting, and if a long-living artifact to be maintained, the on-going costs. It's similar to junior engineers or short-term contractors who never had to build production-grade software before and haven't had to live with their decisions: These are quite learnable engineering skills, and I've found it useful to burn your fingers before having confidence in the surprising weight of cost/benefit decisions. The more autonomy and expectations you are targeting for the agent, the more so.

shnpln

a month ago

Oh yes, I commonly add something like "Use a very granular todo list for this task" at the end of my prompts. And sometimes I will say something like "as your last todo, go over everything you just did again and use a linter or other tools to verify your work is high quality"

hadlock

a month ago

Right now I start chatting with a separate LLM about the issue, the best structure for maintainability and then best libraries for the job, edge and corner cases, how to handle those, and then have it spit out a prompt and a checklist. If it has a UI I'll draw something in paint and refine it before having the LLM describe it in detail and primary workflow etc. and tell it to format it for an agent to use. That will usually get me a functional system on the first try which can then be iterated on.

That's for complicated stuff. For throw-away stuff I don't need to maintain past 30 days like a script I'll just roll the dice and let it rip.

shnpln

a month ago

Yeah, this is a good idea. I will have a Claude chat session and a Claude Code session open side by side too.

Like a manual sub agents approach. I try not to pollute the Claude code session context with meanderings to much. Do that in the chat and bring the condensed ideas over.

renjimen

a month ago

If you have pre-commit hooks it should do this last bit automatically, and use your project settings

shnpln

a month ago

Yes, I do. But it does not always use them when I change contexts. I just get in the habit of saying it. Belt and suspenders approach.

sathish316

a month ago

It’s surprising how simple TodoWrite and TodoRead tools are in planning and making sure an Agent follows the plan.

This is supposed to be an emulator of Claude’s own TodoWrite and TodoRead, which does a full update of a todo.json for every task update. A nice use of composition of edit tool - https://github.com/joehaddad2000/claude-todo-emulator

sathish316

a month ago

Complex planning and orchestration for Multi-step usecases or persistent Todo lists is achievable by spinning up your own tools that does something similar to this.

By extending Claude Todo emulator, It was possible to make the agent come up with Multi-step Hierarchical plans and follow it and track updates on it for usecases like Oncall Troubleshooting Runbooks.

PS: the above open source repo does not provide single task update as a tool, which is not hard to implement on your own

veselin

a month ago

I run evals and the Todo tool doesn't help most of the time. Usually models on high thinking would maintain Todo/state in their thinking tokens. What Todo helps is for cases like Anthropic models to run more parallel tool calls. If there is a Todo list call, then some of the actions after are more efficient.

What you need to do is to match the distribution of how the models were RL-ed. So you are right to say that "do X in 200 lines" is a very small part of the job to be done.

lmeyerov

a month ago

Curious what kinds of evals you focus on?

We're finding investigating to be same-but-different to coding. Probably the most close to ours that has a bigger evals community is AI SRE tasks.

Agreed wrt all these things being contextual. The LLM needs to decide whether to trigger tools like self-planning and todo lists, and as the talk gives examples of, which kind of strategies to use with them.

veselin

a month ago

I am taking for SWE bench style problems where Todo doesn't help, except for more parallelism.

lmeyerov

a month ago

Was guessing that, coding tasks are a valuable but myopic lense :)

I'm guessing a self-updating plan there is sufficient. I'm not actually convinced today's current plan <> todolist flow makes sense - in the linked PLAN.md, it gets unified, and that's how we do ai coding. I don't have evals on this, but from a year of vibes coding/engineering, that's what we experientially reached across frontier coding models & tools. Nowadays we're mixing in evals too, but that's a more complicated story.

jcims

a month ago

Mind if I ask what models you’re using for CTF? I got out of the game about ten years ago and have been recently thinking about doing my toes back in.

lmeyerov

a month ago

Yep -- one fun experiment early in the video is showing sonnet 4.5 -> opus 4.5 gave a 20% lift

We do a bit of model-per-task, like most calls are sending targeted & limited context fetches into faster higher-tier models (frontier but no heavy reasoning tokens), and occasional larger data dumps (logs/dataframes) sent into faster-and-cheaper models. Commercially, we're steering folks right now more to openai / azure openai models, but that's not at all inherent. OpenAI, Claude, and Gemini can all be made to perform well here using what the talk goes over.

Some of the discussion earlyish in the talk and Q&A after is on making OSS models production-grade for these kinds of investigation tasks. I find them fun to learn on and encourage homelab experiments, and for copilots, you can get mileage. For more heavy production efforts, I typically do not recommend them for most teams at this time for quality, speed, practicality, and budget reasons if they have the option to go with frontier models. However, some bigger shops are doing it, and I'd be happy to chat how we're approaching quality/speed/cost there (and we're looking for partners on making this easier for everyone!)

jcims

a month ago

Nice! Thank you!

I just did an experiment yesterday with Opus 4.5 just operating in agent mode in vscode copilot. Handed it a live STS session for AWS to see if it could help us troubleshoot an issue. It was pretty remarkable seeing it chop down the problem space and arrive at an accurate answer in just a few mins.

I'll definitely check out the video later. Thanks!

bdangubic

a month ago

at the end of the year than you get “How to Code Claude Code in 200 Million Lines of Code” :)

redanddead

a month ago

ironic how the peak of code is a TODO list app

libraryofbabel

a month ago

It's a great point and everyone should know it: the core of a coding agent is really simple, it's a loop with tool calling.

Having said that, I think if you're going to write an article like this and call it "The Emperor Has No Clothes: How to Code Claude Code in 200 Lines of Code", you should at least include a reference to Thorsten Ball's excellent article from wayyy back in April 2025 entitled "How to Build an Agent, or: The Emperor Has No Clothes" (https://ampcode.com/how-to-build-an-agent)! That was (as far as I know) the first of these articles making the point that the core of a coding agent is actually quite simple (and all the deep complexity is in the LLM). Reading it was a light-bulb moment for me.

FWIW, I agree with other commenters here that you do need quite a bit of additional scaffolding (like TODOs and much more) to make modern agents work well. And Claude Code itself is a fairly complex piece of software with a lot of settings, hooks, plugins, UI features, etc. Although I would add that once you have a minimal coding agent loop in place, you can get it to bootstrap its own code and add those things! That is a fun and slightly weird thing to try.

(By the way, the "January 2025" date on this article is clearly a typo for 2026, as Claude Code didn't exist a year ago and it includes use of the claude-sonnet-4-20250514 model from May.)

Edit: and if you're interested in diving deeper into what Claude Code itself is doing under the hood, a good tool to understand it is "claude-trace" (https://github.com/badlogic/lemmy/tree/main/apps/claude-trac...). You can use it to see the whole dance with tool calls and the LLM: every call out to the LLM and the LLM's responses, the LLM's tool call invocations and the responses from the agent to the LLM when tools run, etc. When Claude Skills came out I used this to confirm my guess about how they worked (they're a tool call with all the short skill descriptions stuffed into the tool description base prompt). Reading the base prompt is also interesting. (Among other things, they explicitly tell it not to use emoji, which tracks as when I wrote my own agent it was indeed very emoji-prone.)

bredren

a month ago

I've been exploring the internals of Claude Code and Codex via the transcripts they generate locally (these serve as the only record of your interactions with the products)[1].

Given the stance of the article, just the transcript formats reveals what might be a surprisingly complex system once you dig in.

For Claude Code, beyond the basic user/assistant loop, there's uuid/parentUuid threading for conversation chains, queue-operation records for handling messages sent during tool execution, file-history-snapshots at every file modification, and subagent sidechains (agent-*.jsonl files) when the Task tool spawns parallel workers.

So "200 lines" captures the concept but not the production reality of what is involved. It is particularly notable that Codex has yet to ship queuing, as that product is getting plenty of attention and still highly capable.

I have been building Contextify (https://contextify.sh), a macOS app that monitors Claude Code and Codex CLI transcripts in real-time and provides a CLI and skill called Total Recall to query your entire conversational history across both providers.

I'm about to release a Linux version and would love any feedback.

[1] With the exception of Claude Code Web, which does expose "sessions" or shared transcripts between local and hosted execution environments.

jake-coworker

a month ago

IMO these articles are akin to "Twitter in 200 lines of code!" and "Why does Uber need 1000 engineers?" type articles.

They're cool demos/POCs of real-world things, (and indeed are informative to people who haven't built AI tools). The very first version of Claude Code probably even looked a lot like this 200 line loop, but things have evolved significantly from there

tomtomtom777

a month ago

> IMO these articles are akin to "Twitter in 200 lines of code!"

I don't think it serves the same purpose. Many people understand the difference between a 200 lines twitter prototype and the real deal.

But many of those may not understand what the LLM client tool does and how it relates to the LLM server. It is generally consumed as one magic black box.

This post isn't to tell us how everyone can build a production grade claude-code; it tells us what part is done by the CLI and what part is done by the LLM's which I think is a rather important ingredient in understanding the tools we are using, and how to use them.

d4rkp4ttern

a month ago

Nice, I have something similar [1], a super-fast Rust/Tantivy-based full-text search across Claude Code + Codex-CLI session JSONL logs, with a TUI (for humans) and a CLI/JSONL mode for agents.

For example there’s a session-search skill and corresponding agent that can do:

    aichat search —json  [search params]

So you can ask Claude Code to use the searcher agent to recover arbitrary context of prior work from any of your sessions, and build on that work in a new session. This has enabled me to completely avoid compaction.

[1] https://github.com/pchalasani/claude-code-tools?tab=readme-o...

dnw

a month ago

That is a cool tool. Also one can set "cleanupPeriodDays": in ~/.claude/settings.json to extend cleanup. There is so much information these tools keep around we could use.

I came across this one the other day: https://github.com/kulesh/catsyphon

Johnny_Bonk

a month ago

This is very interesting, especially if you could then use an llm across that search to figure out what has and maybe has not been completed, and then reinject those findings into a new Claude code session

bredren

a month ago

I haven't written the entry yet but it is pretty incredible what you can get when letting a frontier model RAG your complete CLI convo history.

You can find out not just what you did and did not do but why. It is possible to identify unexpectedly incomplete work streams, build a histogram of the times of day you get most irritated with the AI, etc.

I think it is very cool and I have a major release coming. I'd be very appreciative of any feedback.

handfuloflight

a month ago

For that you'd be better off having the LLM write TODO stubs in the codebase and search for that. In fact, most of the recent models just do this, even without prompting.

lelanthran

a month ago

> So "200 lines" captures the concept but not the production reality of what is involved.

How many lines would you estimate it takes to capture that production reality of something like CC? I ask because I got downvoted for asking that question on a different story[1].

I asked because in that thread someone quoted the CC dev(s) as saying:

>> In the last thirty days, I landed 259 PRs -- 497 commits, 40k lines added, 38k lines removed.

My feeling is that a tool like this, while it won't be 200 lines, can't really be 40k lines either.

[1] If anyone is interested, https://news.ycombinator.com/item?id=46533132

foltik

a month ago

My guess is <5k for a coherent and intentional expert human design. Certainly <10k.

It’s telling that they can’t fix the screen flickering issue, claiming “the problem goes deep.”

misternugget

a month ago

Hey! Thorsten Ball here. Thanks for the shout-out. I was quite confused when someone sent me this article: same "Emperor has no clothes", same "it's only x hundred lines", implements the same tools, it even uses the same ASCII colors when printing you/assistant/tool. Then I saw the "January 2025" in the title and got even more confused.

So, thanks for your comment and answering all the questions I had just now about "wait, did I wake up in a parallel universe where I didn't write the post but someone else did?"

libraryofbabel

a month ago

Hi! Thanks again for writing that first Emperor Has No Clothes blog post; like I said, it really inspired me and made everything click early on when I was first dipping my toes into the world of agents. Whenever I teach this stuff to other engineers, I often think back to that moment of realizing exactly how the exchange between LLM, tool call requests, tool functions, and agent code works and I try to get that across as the central takeaway. These days I usually add diagrams to get really clear on what happens on which side of the model api.

I do wonder whether the path here was:

1) You wrote the article in April 2025

2) The next generation of LLMs trained on your article

3) The author of TFA had a similar idea, and heavily used LLMs to help write the code and the article, including asking the LLM to think of a catchy title. And guess what title the LLM comes up with?

There are also less charitable interpretations, of course. But I'd rather assume this was honestly, if sloppily, done.

justanotherprof

a month ago

Many thanks for your article, it was one of the true "aha" moments for me in 2025! It is a shame that your work is apparently being appropriated without attribution to sell an online course...

aszen

a month ago

The most imp part is editing code, to do that reliably Claude models are trained on their own str replace tool schema I think. Models find it hard to modify existing code, they also can't just rewrite whole files bcz that's expensive and doesn't scale.

embedding-shape

a month ago

Here's where I was hoping openly available models would shine. Some community gets together, starts sharing successful/failed runs with their own agent, start building a open dataset for their specific syntax and tooling. then finally finetune new variants with it for the community.

libraryofbabel

a month ago

Yeah, there is definitely some RLVR training going on for the Claude LLMs to get them good at some of the specific tool calls used in Claude Code, I expect. Having said that, the string replacement tool schema for file edits is not very complicated at all (you can see it in the tool call schema Claude Code sends to the LLM), so you could easily use that in your own 200-300 line agent if you wanted to make sure you're playing to the LLM's strengths.

aszen

a month ago

Yeah that's one example, but I suspect they train the model on entire sequences of tool calls, so unless you prompt the model exactly as them you won't get the same results.

There's a reason they won the agent race, their models are trained to use their own tools.

libraryofbabel

a month ago

Agree, the RLVR tasks are probably long series of tool calls at this point doing complex tasks in some simulated dev environment.

That said, I think it's hard to say how much of a difference it really makes in terms of making Claude Code specifically better than other coding agents using the same LLM (versus just making the LLM better for all coding agents using roughly similar tools). There is probably some difference, but you'd need to run a lot of benchmarks to find out.

aszen

a month ago

Agreed it probably contributes to the model improving for all agents but crucially it is verifiably better against their own agent. So they get a good feedback loop to improve both

alansaber

a month ago

Ah I just assumed it was the same article reposted

justanotherprof

a month ago

I am glad you pointed out Thorsten Ball's truly excellent article: I was about to add a comment to that effect!

KellyCriterion

a month ago

can you show us the >>core of a coding agent which is, according to your words, >>really simple and would you mind sharing a URL so I could check it out then?

libraryofbabel

a month ago

It's in TFA or in the https://ampcode.com/how-to-build-an-agent article I linked? Or is that not what you're looking for?

KellyCriterion

a month ago

Sorry, sounded like "your version", instead of the one listed :-)

joshmlewis

a month ago

This is cool but as someone that's built an enterprise grade agentic loop in-house that's processing a billion plus tokens a month, there are so many little things you have to account for that greatly magnify complexity in real world agentic use cases. For loops are an easy way to get your foot in the door and is indeed at the heart of it all, but there are a multitude of a little things that compound complexity rather quickly. What happens when a user sends a message after the first one and the agent has already started the tool loop? Seems simple, right? If you are receiving inputs via webhooks (like from a Slack bot), then what do you do? It's not rocket science but it's also not trivial to do right. What about hooks (guardrails) and approvals? Should you halt execution mid-loop and wait or implement it as an async Task feature like Claude Code and the MCP spec? If you do it async then how do you wake the agent back up? Where is the original tool call stored and how is the output stored for retrieval/insertion? This and many other little things add up and compound on each other.

I should start a blog with my experience from all of this.

visarga

a month ago

A quick glance over the 200 LOC impl - I see no error handling. This is the core of the agent loop, you need to pass some errors back to the LLM to adapt, while other errors should be handled by the code. There is also no model specific code for structured decoding.

This article could make for a good interview problem, "what is missing and what would you improve on it?"

joebates

a month ago

Please do! This sounds way more interesting than a simple coding loop agent (not to knock the blog)

handfuloflight

a month ago

> I should start a blog with my experience from all of this.

Please do.

jacob019

a month ago

Seems everyone is working on the same things these days. I built a persistent Python REPL subprocess as an MCP tool for CC, it worked so insanely well that I decided to go all the way. I already had an agentic framework built around tool calling (agentlib), so I adapted it for this new paradigm and code agent was born.

The agent "boots up" inside the REPL. Here's the beginning of the system prompt:

  >>> help(assistant)

  You are an interactive coding assistant operating within a Python REPL.
  Your responses ARE Python code—no markdown blocks, no prose preamble.
  The code you write is executed directly.

  >>> how_this_works()

  1. You write Python code as your response
  2. The code executes in a persistent REPL environment
  3. Output is shown back to you IN YOUR NEXT TURN
  4. Call `respond(text)` ...

You get the idea. No need for custom file editing tools--Python has all that built in and Claude knows it perfectly. No JSON marshaling or schema overhead. Tools are just Python functions injected into the REPL, zero context bloat.

I also built a browser control plugin that puts Claude directly into the heart of a live browser session. It can inject element pickers so I can click around and show it what I'm talking about. It can render prototype code before committing to disk, killing the annoying build-fix loop. I can even SSH in from my phone and use TTS instead of typing, surprisingly great for frontend design work. Knocked out a website for my father-in-law's law firm (gresksingleton.com) in a few hours that would've taken 10X that a couple years ago, and it was super fun.

The big win: complexity. CC has been a disaster on my bookkeeping system, there's a threshold past which Claude loses the forest for the trees and makes the same mistakes over and over. Code agent pushes that bar out significantly. Claude can build new tools on the fly when it needs them. Gemini works great too (larger context).

Have fun out there! /end-rant

giancarlostoro

a month ago

> there's a threshold past which Claude loses the forest for the trees and makes the same mistakes over and over.

Try using something like Beads with Claude Code. Also don't forget to have a .claude/instructions.md file, you can literally ask Claude to make it for you, this is the file Claude reads every time you make a new prompt. If Claude starts acting "off" you tell it to reread it again. With Beads though, I basically tell Claude to always use it when I pitch anything, and to always test that things build after it thinks its done, and to ask me to confirm before closing a task (They're called beads but Claude figures what I mean).

With Beads the key thing I do though once all that stuff is setup is I give it my ideas, it makes a simple ticket or tickets. Then I both braindump and / or ask it to do market research on each item and the parameters I want to be considered, and then to update the task accordingly. I then review them all. Then I can go "work on these in parallel" and it spins up as many agents as there are tasks, and goes to work. I don't always make it do the work in parallel if its a lot of tasks because Zed has frozen up on me, I read that Claude Code is fine its just the protocol that Zed uses that gets hosed up.

I find that with Beads, because I refine the tickets with Claude, it does way better at tasks this way. It's like if you're hand crafting the "perfect" prompt, but the AI model is writing it, so it will know exactly what you mean because it wrote it in its own verbiage.

jacob019

a month ago

Ohh, looks interesting, thanks for the tip!

Git as db is clever, and the sqlite cache is nice. I'd been sketching sqlite based memory features myself. So much of the current ecosystem is suboptimal just because it's new. The models are trained around immutable conversation ledgers with user/tool/assistant blocks, but there are compelling reasons to manipulate both sides at runtime and add synthetic exchanges. Priming with a synthetic failure and recovery is often more effective than a larger, more explicit system message. Same with memory, we just haven't figured out what works best.

For code agents specifically, I found myself wanting old style completion models without the structured turn training--doesn't exist for frontier models. Talking to Claude about its understanding of the input token stream was fascinating; it compared it to my visual cortex and said it'd be unreasonable to ask me to comment on raw optic nerve data.

"Tell it to reread it again"-exactly the problem. My bookkeeping/decision engine has so much documentation I'm running out of context, and every bit is necessary due to interconnected dependencies. Telling it to reread content already in the window feels wrong, that's when I refine the docs or adjust the framework. I've also found myself adding way more docstrings and inline docs than I'd ever write for myself. I prefer minimal, self-documenting code, so it's a learning process.

There's definitely an art to prompts, and it varies wildly by model, completely different edge cases across the leading ones. Thanks again for the tip; I suspect we'll see a lot of interesting memory developments this year.

giancarlostoro

a month ago

I am working on a strict sqlite alternative to Beads as an experiment so far it works.

freehorse

a month ago

This sounds really cool! I love the idea behind it, the agent having persistent access to a repl session, as I like repl-based workflows in general. Do you have any code public from this by any chance?

jacob019

a month ago

https://github.com/jacobsparts/agentlib

See CodeAgent or subrepl.py if you're just interested in the REPL orchestration. I also have a Python REPL MCP server that works with CC. It isn't published, but I could share it by request.

My favorite part of code agent is the /repl command. I can drop into the REPL mid session and load modules, poke around with APIs and data, or just point Claude in the right direction. Sometimes a snippet of code is worth 1000 words.

dunk010

a month ago

I request you to publish it! :-D

jacob019

16 days ago

I haven't had time to make a repo, so I'll just drop this here for you: https://jacobstoner.com/subrepl_mcp.py

Just make sure you have the deps installed and add it to CC as a stdio mcp server. Tested on Linux only. I use it daily.

jacob019

a month ago

Okay, I will :)

I'll add a new repo for it, maybe this evening, and I'll reply here with a link.

jacob019

16 days ago

freehorse

a month ago

Thanks!

bschmidt241

a month ago

[dead]

nyellin

a month ago

There's a bit more to it!

For example, the agent in the post will demonstrate 'early stopping' where it finishes before the task is really done. You'd think you can solve this with reasoning models, but it doesn't actually work on SOTA models.

To fix 'early stopping' you need extra features in the agent harness. Claude Code does this with TODOs that are injected back into every prompt to remind the LLM what tasks remain open. (If you're curious somewhere in the public repo for HolmesGPT we have benchamrks with all the experiments we ran to solve this - from hypothesis tracking to other exotic approaches - but TODOs always performed best.)

Still, good article. Agents really are just tools in a loop. It's not rocket science.

d4rkp4ttern

a month ago

Yes this “premature termination”, becomes particularly evident when you switch out Opus/Sonnet with a weaker LLM, and also happens more often in Codex CLI with GPT-5.

Since one of the replies asked for an example: the agent works for a bit and just stops. We’ve all seen cases where the agent simply says “ok, let me read the blah.py to understand the context better”, and just stops. It has essentially forgotten to use a tool for its next edit or read etc.

rtgfhyuj

a month ago

why would it early stop? examples?

mickeyp

a month ago

Models just naturally arrive at a conclusion that they are done. TODO hints can help, but is not infallible: Claude will stop and happily report there's more work to be done and "you just say the word Mister and I'll continue" --- this is a RL problem where you have to balance the chance of an infinite loop (it keeps thinking there's a little bit more to do when there is not) versus the opposite where it stops short of actual completion.

wxce

a month ago

> this is a RL problem where you have to balance the chance of an infinite loop (it keeps thinking there's a little bit more to do when there is not) versus the opposite where it stops short of actual completion.

Any idea on why the other end of the spectrum is this way -- thinking that it always has something to do?

I can think of a pet theory on it stopping early -- that positive tool responses and such bias it towards thinking it's complete (could be extremely wrong)

skybrian

a month ago

My pet theory: LLM's are good at detecting and continuing patterns. Repeating the same thing is a rather simple pattern, and there's no obvious place to stop if an LLM falls into that pattern unintentionally. At least to an unsophisticated LLM, the most likely completion is to continue the pattern.

So infinite loops are more of a default, and the question is how to avoid them. Picking randomly (non-zero temperature) helps prevent repetition sometimes. Other higher-level patterns probably prevent this from happening most of the time in more sophisticated LLM's.

yencabulator

a month ago

> Any idea on why the other end of the spectrum is this way -- thinking that it always has something to do?

Who said anything about "thinking"? Smaller models were notorious for getting stuck repeating a single word over and over, or just "eeeeeee" forever. Larger models only change probabilities, not the fundamental nature of the machine.

embedding-shape

a month ago

Not all models are trained with long one-shot task following by themselves, seems many of them prefer closer interactions with the user. You could always add another layer/abstraction above/below to work around it.

fastball

a month ago

Can't this just be a Ralph Wiggum loop (i.e. while True)

embedding-shape

a month ago

Sure, but I think just about everyone wants the agent to eventually say "done" in one way or another.

prodigycorp

a month ago

This article was more true than not a year ago but now the harnesses are so far past the simple agent loop that I'd argue that this is not even close to an accurate mental model of what claude code is doing.

qsort

a month ago

Obviously modern harnesses have better features but I wouldn't say it invalidates the mental model. Simpler agents aren't that far behind in performance if the underlying model is the same, including very minimal ones with basic tools.

I'd say it's similar to how a "make your own relational DB" article might feature a basic B-tree with merge-joins. Yeah, obviously real engines have sophisticated planners, multiple join methods, bloom filters, etc., but the underlying mental model is still accurate.

prodigycorp

a month ago

You’re not wrong but I still think that the harness matters a lot when trying to accurately describe Claude Code.

Here’s a reframing:

If you asked people “what would you rather work with, today’s Claude Code harness with sonnet 3.7, or the 200 line agentic loop in the article with Opus 4.5, which would you choose?”

I suspect many people would choose 3.7 with the harness. Moreover, that is true, then I’d say the article is no longer useful for a modern understanding of Claude Code.

aszen

a month ago

I don't think so, model improvements far outweigh any harness or tooling.

Look at https://github.com/SWE-agent/mini-swe-agent for proof

prodigycorp

a month ago

Yes but people aren’t choosing CC because they are necessarily performance maximalists. They choose it because it has features that make it behave much more nicely as a pair programming assistant than mini-swe-agent.

There’s a reason Cursor poached Boris Cherney and Cat Wu and Anthropic hired them back!

aszen

a month ago

They nailed down the UX I would say and the models themselves are a lot better even outside of CC

prodigycorp

a month ago

I don’t think I disagree with you about anything, I’m trying to split hairs at this point.

rfw300

a month ago

Any person who would choose 3.7 with a fancy harness has a very poor memory about how dramatically the model capabilities have improved between then and now.

prodigycorp

a month ago

I’d be very interested in the performance of 3.7 decked out with web search, context7, a full suite of skills, and code quality hooks against opus 4.5 with none of those. I suspect it’s closer than you think!

CuriouslyC

a month ago

Skills don't make any difference above having markdown files to point an agent to with instructions as needed. Context7 isn't any better than telling your agent to use trafilatura to scrape web docs for your libs, and having a linting/static analysis suite isn't a harness thing.

3.7 was kinda dumb, it was good at vibe UIs but really bad at a lot of things and it would lie and hack rewards a LOT. The difference with Opus 4.5 is that when you go off the Claude happy path, it holds together pretty well. With Sonnet (particularly <=4) if you went off the happy path things got bad in a hurry.

prodigycorp

a month ago

Yeah. 3.7 was pretty bad. I remember its warts vividly. It wanted to refactor everything. Not a great model on which to hinge this provocation.

But skills do improve model performance, OpenAI posted some examples of how it massively juiced up their results on some benchmarks.

nl

a month ago

> I suspect it’s closer than you think!

It's not.

I've done this (although not with all these tools).

For a reasonable sized project it's easy to tell the difference in quality between say Grok-4.1-Fast (30 on AA Coding Index) and Sonnet 4.5 (37 on AA).

Sonnet 3.7 scores 27. No way I'm touching that.

Opus 4.5 scores 46 and it's easy to see that difference. Give the models something with high cyclomtric complexity or complex dependency chains and Grok-4.1-Fast falls to bits, Opus 4.5 solves things.

nl

a month ago

This is SO wrong.

I actually wrote my own simple agent (with some twists) in part so I could compare models.

Opus 4.5 is in a completely different league to Sonnet 4.5, and 3.7 isn't even on the same planet.

I happily use my agent with Opus but there is no world in which I'd use a Sonnet 3.7 level model for anything beyond simple code completion.

alright2565

a month ago

But does that extra complexity actually improve performance?

https://www.tbench.ai/leaderboard/terminal-bench/2.0 says yes, but not as much as you'd think. "Terminus" is basically just a tmux session and LLM in a loop.

prodigycorp

a month ago

I'm not a good representative for claude code because I'm primarily a codex user now, but I know that if codex had subagents it would be at least twice as productive. Time spent is an important aspect of performance so yup, the complexity improved performance.

nyellin

a month ago

Not necessarily true. Subagents allow for parallelization but they can decrease accuracy dramatically if you're not careful because there are often dependencies between tasks and swapping context windows with a summary is extremely lossy.

For the longest time, Claude Code itself didnt really use subagents much by default, other than supporting them as a feature eager users could configure. (Source is reverse engineering we did on Claude code using the fantastic CC tracing tool Simon Willison wrote about once. This is also no longer true on latest versions that have e.g. an Explore subagent that is actively used.)

prodigycorp

a month ago

You’re right that subagents were more likely to cause issues than be helpful. But, when properly understood, lead to so much time saved through parallelization for tasks that warranted it.

I was having codex organize my tv/movie library the other day by having it generate. most of the files were not properly labeled. I had codex generate transcripts, manually search the movie db to find descriptions of show episodes, and match the show descriptions against the transcripts to figure out which episode/season the show belonged to.

Claude Code could have parallelized those manual checks and finished that task at 8x the speed.

terminalshort

a month ago

Are subagents a fundamental change, or just acting as inner loops to the agentic loop similar to the one in the article?

steveklabnik

a month ago

Subagents, in my understanding, are just a tool call from the perspective of the parent.

lukan

a month ago

The article was also published one year ago on january 2025.

(Should have 2025 in the title? Time flies)

llmslave2

a month ago

Claude Code didn't exist in January 2025. I think it's a typo and should be 2026.

prodigycorp

a month ago

You’re right. No wonder the date felt odd. iirc Claude code was released around march.

dkdcio

a month ago

late Feb. 2025: https://www.anthropic.com/news/claude-3-7-sonnet

CuriouslyC

a month ago

Less true than you think. A lot of the progress in the last year has been tightening agentic prompts/tools and getting out of the way so the model can flex. Subagents/MCP/Skills are all pretty mid, and while there has been some context pruning optimization to avoid carrying tool output along forever, that's mainly a benefit to long running agents and for short tasks you won't notice.

prodigycorp

a month ago

All of these things you mentioned are put into a footnote of the article.

dkdcio

a month ago

it seems to have changed a ton in recent versions too — I would love more details on what exactly

I find it doing what I in the past had to interrupt and tell it to do fairly frequently now

terminalshort

a month ago

For one thing it seems to splitting up the work and making some determination of complexity, then allocating it out to a model based on that complexity to save resources. When I run Claude with Opus 4.5 and run /cost I see tokens for Opus 4.5, but also a lot in Sonnet and Haiku, with the majority of tokens actually being used by Haiku.

nyellin

a month ago

Haiku is called often, but not always the way you think. E.g. every time you write something CC invokes Haiku multiple times to generate the 'delightful 1-2 word phrase used to indicate progress to the user' (Doing Stuff, Wizarding, etc)

dkdcio

a month ago

it’s also used in the Explore agent and for other things too

pama

a month ago

Agreed. You can get a better model using the codex-cli repo and having an agent help you analyze the core functionality.

splike

a month ago

I'm interested, could you expand on that?

prodigycorp

a month ago

Off the top of my head: parallel subagents, hooks, skills, and a much better plan mode. These features enable way better steering than we had last year. Subagents are a huge boon to productivity.

rtgfhyuj

a month ago

are subagents just tools that are agents themselves?

dkdcio

a month ago

pretty much…they have their own system prompts, you can customize the model, the tools they use, etc.

CC has built in subagents (including at least one not listed) that work very well: https://code.claude.com/docs/en/sub-agents#built-in-subagent...

this was not the case in the past, I swore off subagents, but they got good at some point

floppyd

a month ago

> This is the key insight: we’re just telling the LLM “here are your tools, here’s the format to call them.” The LLM figures out when and how to use them.

This really blew my mind back then in the ancient times of 2024-ish. I remember the idea of agents just reached me and I started reading various "here I built an agent that does this" articles, and I was really frustrated at not understanding how the hell LLM "knows" how to call a tool, it's a program, but LLMs just produce text! Yes I see you are telling LLM about tools, but what's next? And then when I finally understood that there's no next, no need to do anything other than explaining — it felt pretty magical, not gonna lie.

utopiah

a month ago

Tools documentation is in text, either directly e.g. tool -h or indirectly e.g. man tool plus they are countless examples of usage online, so it seems clear that a textual mapping between the tool and its usage exists in text form already.

voidhorse

a month ago

Running a command in a shell is a string of text. LLMs produce text and it's easy to write a program that then executes that text as a process. I don't see what's magical about it at all.

naasking

a month ago

It would seem magical if you think of LLMs as token predictors with zero understanding of what they're doing. This is evidence that there's more going on though.

ofirpress

a month ago

We (the SWE-bench team) have a 100 line of code agent that is now pretty popular in both academic and industry labs: https://github.com/SWE-agent/mini-swe-agent

I think it's a great way to dive into the agent world

miki123211

a month ago

All you actually need is 50 lines and one tool.

If your agent can execute Bash commands, it can do anything, including reading files (with cat), writing them (with sed / patch / awk /perl), grepping, finding, and everything else you may possibly need. The specialized tools are just an optimization to make things easier for the agent. They do increase performance (in the "how much can this do", not the "how fast is this" sense), but they're not strictly required.

IMHO, this is one of the more significant LLM-related discoveries of 2025. You don't need a context-polluting Github MCP that takes 10+% of your precious context window, all you need is the gh cli, which the agent already knows how to use.

sams99

a month ago

For those interested, edit is a surprisingly difficult problem, it seems easy on the surface but there is both fine tuning and real world hallucinations you are fighting with. I implemented one this week in:

https://github.com/samsaffron/term-llm

It is about my 10th attempt at the problem so I am aware of a lot of the edge cases, a very interesting bit of research here is:

https://gist.github.com/SamSaffron/5ff5f900645a11ef4ed6c87f2...

Fascinating read.

ulaw

a month ago

How many Claudes could Claude Code code if Claude Code could code Claude?

Okkef

a month ago

Claude has a nice answer to your riddle:

Claude Code could code all the Claudes Claude Code could code, because Claude Code already coded the Claude that codes Claude Code.

Or more philosophically: The answer is recursively infinite, because each Claude that gets coded can then help code the next Claude, creating an ever-improving feedback loop of Claude-coding Claudes. It's Claudes all the way down!

tuhgdetzhh

a month ago

"if Claude Code could code Claude?"

Claude Code already codes Claude Code.

The limit is set by the amount of GPUs and energy supply.

handfuloflight

a month ago

The human element of the Anthropic organization also has some limits placed there.

user

a month ago

[deleted]

dmvaldman

a month ago

This misses that agentic LLMs are trained via RL to use specific tools. Adding custom tools is subpar to those the model has been trained with. That's why Claude Code has an advantage, over say, Cursor, by being vertically integrated.

firloop

a month ago

But if one were to write tools that were "abi-compatible" with Claude Code's, could you see similar performance with a custom agent? And if so - is Cursor not doing just that?

yencabulator

a month ago

Part of that is likely that Claude Code tools keep changing a little. Imitating them is chasing a moving target.

mikmoila

a month ago

Are they really? I've been under impression that agentic LLMs are just instances of the LLMs, no "specialized training" involved

m-hodges

a month ago

Also relevant: You Should Write An Agent¹ and, How To Build An Agent.²

¹ https://fly.io/blog/everyone-write-an-agent/

² https://ampcode.com/how-to-build-an-agent

mirzap

a month ago

The "200 lines" loop is a good demo of the shape of a coding agent, but it’s like "a DB is a B-tree" - technically true, operationally incomplete.

The hard part isn’t the loop - it’s the boring scaffolding that prevents early stopping, keeps state, handles errors, and makes edits/context reliable across messy real projects.

tptacek

a month ago

What's interesting to me about the question of whether you could realistically compete with Claude Code (not Claude, but the CLI agent) is that the questions boil down to things any proficient developer could do. No matter how much I'd want to try, I have no hope of building a competitive frontier model --- "frontier model" is a distinctively apt term. But there's no such thing as a "frontier agent", and the Charmbracelet people have as much of a shot at building something truly exception as Anthropic does.

libraryofbabel

a month ago

This is a great point, although I would add that Anthropic has a possible slight advantage, as they can RLVR the Claude LLMs themselves on Claude Code tool calls and Claude Code tasks. Having said that, it's not clear how much that really matters at all for making the Claude Code CLI specifically better-performing than other coding agents using the same LLM (the tool calls are fairly generic and the LLMs are good at plenty of tool calls they weren't RLVRed on too).

The other advantage Anthropic have is just that they can sell CC subscriptions at lower cost because they own the models. But that's a separate set of questions that don't really relate to technical capabilities.

Anyhow, to follow up on your point, I do find it surprising that Claude Code is still (it seems?) definitively leading the pack in terms of coding agents. I've tried Gemini CLI and Codex and they feel distinctly less good, but I'm surprised we haven't seen too many alternatives from small startups or open source projects rise to the top as well. After all, they can build on all the lessons learned from previous agents (UX, context management, features people like such as Skills etc.). Maybe we will see more of this in 2026.

NitpickLawyer

a month ago

> the Charmbracelet people have as much of a shot at building something truly exception as Anthropic does.

Yes and no. OpenCode is a great example of yes. But at the same time Anthropic gets to develop both client and model together. THey get to use the signals from the client, and "bake in" some of the things into the model. So their model will work best with their client. And somewhat less competent with other clients (you can kinda sorta see that today with opus in cc vs. in cursor).

For example, cc was (to my knowledge) the first client to add <system_reminder> tags from time to time. How often, how the model used them and so on? That's basically "signals", and they work together. And it works beautifully, as cc seems to stay on task better than OpenCode while using the same model.

tptacek

a month ago

Anthropic has obvious advantages and I'm not saying there's a level playing field (they also have the financial resources of a mid-sized industrialized nation). I'm saying that there's an absolute limit to how much work you could personally do on a frontier model, and that limit doesn't exist for agents; you could realistically clever your way out ahead of Claude Code --- who knows? We've only had these things working for real for a year.

embedding-shape

a month ago

> No matter how much I'd want to try, I have no hope of building a competitive frontier model

A single person, probably not. But a group of dedicated FOSS developers who together build a wide community contributing to one open model that could be continuously upgraded? Maybe.

maurycy

a month ago

Maybe not necessarily and the Claude model is fine-tuned for `claude`, so no one can really replicate the experience without unlocking some secret mode in the model. The other comments about editing files hint at this.

vinhnx

a month ago

This reminds me of Amp's article last year[1]. I building my own coding agent [2]. Two goals: understand real-world agent mechanics and validate patterns I'd observed across OpenAI Codex and contemporary agents.

The core loop is straightforward: LLM + system prompt + tool calls. The differentiator is the harness, CLI, IDE extension, sandbox policies, filesystem ops (grep/sed/find). But what separates effective agents from the rest is context engineering. Anthropic and Manus has published various research articles around this topic.

After building vtcode, my takeaway: agent quality reduces to two factors, context management strategy and model capability. Architecture varies by harness, but these fundamentals remain constant.

[1] https://ampcode.com/how-to-build-an-agent [2] https://github.com/vinhnx/vtcode [3] https://www.anthropic.com/engineering/building-effective-age...

afarah1

a month ago

Reminds me of this 2023 post "re-implementing LangChain in 100 lines of code": https://blog.scottlogic.com/2023/05/04/langchain-mini.html

We did just that back then and it worked great, we used it in many projects after that.

prodigycorp

a month ago

How was this three years ago ;_;

armcat

a month ago

The new mental model actually is (1) skills based model, i.e. https://agentskills.io/home, and (2) where the LLM agents "see all problems as coding problems". Skills are a bunch of detailed Markdowns and corresponding code libraries and snippets. The mental model thereby loops as follows: read only the top level descriptions in each SKILL.md, use those in-context to decide which skill to pick, after picking the relevant skill read the skill in-depth to choose which code/lib to use, based on the <problem, code/lib> generate new code, execute the code, evaluate, repeat. The problem-as-a-code mental model is also a great way of evaluating, and creating rewards and guarantees.

thiagowfx

a month ago

The blog post starts with:

> I’m using OpenAI here, but this works with any LLM provider

Have you noticed there’s no OpenAI in the post?

RagnarD

a month ago

This feels like a pretty deceptive article title. At the end, he does say:

"What We Built vs. Production Tools This is about 200 lines. Production tools like Claude Code add:

Better error handling and fallback behaviors Streaming responses for better UX Smarter context management (summarizing long files, etc.) More tools (run commands, search codebase, etc.) Approval workflows for destructive operations

But the core loop? It’s exactly what we built here. The LLM decides what to do, your code executes it, results flow back. That’s the whole architecture."

But where's the actual test cases of the performance of his little bit of code vs. Claude Code? Is the core of Claude Code really just what he wrote (he boldly asserts 'exactly what we built here')? Where's the empirical proof?

johnsmith1840

a month ago

Here's the bigger question. Why would you?

Claude code feels like the first commodity agent. In theory its simple but in practice you'll have to maintain a ton of random crap you get no value in maintaining.

My guess is eventually all "agents" will be wipped out by claude code or something equivalent.

Maybe not the companies will die but that all those startups will just be hooking up a generic agent wrapper and let it do its thing directly. My bet is that that the company that would win this is the one with the most training data to tune their agent to use their harness correctly.

utopiah

a month ago

Because you don't trust Anthropic or you like to learn how the tools you rely on work?

hazrmard

a month ago

This reflects my experience. Yet, I feel that getting reliability out of LLM calls with a while-loop harness is elusive.

For example

- how can I reliably have a decision block to end the loop (or keep it running)?

- how can I reliably call tools with the right schema?

- how can I reliably summarize context / excise noise from the conversation?

Perhaps, as the models get better, they'll approach some threshold where my worries just go away. However, I can't quantify that threshold myself and that leaves a cloud of uncertainty hanging over any agentic loops I build.

Perhaps I should accept that it's a feature and not a bug? :)

nyellin

a month ago

Forgot to address the easiest part:

> - how can I reliably call tools with the right schema?

This is typically done by enabling strict mode for tool calling which is a hermetic solution. Makes llm unable to generate tokens that would violate the schema. (I.e. LLM samples tokens only from the subset of tokens that lead to valid schema generation.)

nyellin

a month ago

Re (1) use a TODOs system like Claude code.

Re (2) also fairly easy! It's just a summarization prompt. E.g. this is the one we use in our agent: https://github.com/HolmesGPT/holmesgpt/blob/62c3898e4efae69b...

Or just use the Claude Code SDK that does this all for you! (You can also use various provider-specific features for 2 like automatic compaction on OpenAI responses endpoint.)

schmuhblaster

a month ago

As an experiment over the holidays I had Opus create a coding agent in a Prolog DSL (more than 200 lines though) [0] and I was surprised how well the agent worked out of the box. So I guess that the latest one or two generations of models have reached a stage where the agent harness around the model seems to be less important than before.

[0] https://news.ycombinator.com/item?id=46527722

sathish316

a month ago

Excellent article on the internals of coding CLIs.

I learned a similarly powerful way to build DIY coding CLIs from this Martin Fowler post, which uses PydanticAI and MCP-based tools: https://martinfowler.com/articles/build-own-coding-agent.htm...

Once you understand the underlying LLM tool-calling protocols described here—and how MCP tool calls work (they’re conceptually very similar)—most coding CLIs stop feeling like magic. Anthropic’s own deep dive on MCP was especially useful for me in seeing how to integrate this into a DIY “Claude Code”-style CLI, and even adapt the same approach for non-coding agents as well: https://www.deeplearning.ai/short-courses/mcp-build-rich-con...

bjacobso

a month ago

https://www.youtube.com/watch?v=aueu9lm2ubo

vrosas

a month ago

Unless there's context, I'm never clicking on a naked youtube link.

pests

a month ago

Are you worried google is going to hack you or something?

handfuloflight

a month ago

He was told they were never gonna give up.

jackfranklyn

a month ago

The benchmark point is interesting but I think it undersells what the complexity buys you in practice. Yes, a minimal loop can score similarly on standardised tasks - but real development work has this annoying property of requiring you to hold context across many files, remember what you already tried, and recover gracefully when a path doesn't work out.

The TODO injection nyellin mentions is a good example. It's not sophisticated ML - it's bookkeeping. But without it, the agent will confidently declare victory three steps into a ten-step task. Same with subagents - they're not magic, they're just a way to keep working memory from getting polluted when you need to go investigate something.

The 200-line version captures the loop. The production version captures the paperwork around the loop. That paperwork is boring but turns out to be load-bearing.

dfajgljsldkjag

a month ago

[flagged]

rd

a month ago

Anyone who disagrees with this, please check the OP's previous comments. That's all the proof you need.

And then, as an exercise, ask yourself why you were willing to give this comment leniency?

FrontierProject

a month ago

This site has gone full Tower of Babel. I've seen at least a thousand "AI comment" callouts on this site in the last month and at this point I'm pretty sure 99% of them are wrong.

In fact, can someone link me to a disputed comment that the consensus ends up being it's actually AI? I don't think I've seen one.

NitpickLawyer

a month ago

You know how the chicken sexers do their thing, but can't explain it? Like they can't write a list of things they check for. And when they want to train new people they have them watch (apprentice style) the current ones, and eventually they also become good at doing it themselves?

It's basically that. I can't explain it (I tried listing the tells in a comment below), but it's not just a list of things you notice. You notice the whole message, the cadence, the phrases that "add nothing". You play with enough models, you see enough generations and you start to "see it".

If you'd like to check for yourself, check that user's comment history. It will become apparent after a few messages. They all have these tells. I don't know how else to explain it, but it's there.

matsemann

a month ago

> You know how the chicken sexers

That's certainly a novel and confusing entry in my search history.

lpellis

a month ago

I think this might be one of the first times I didnt notice it, but just look through the comment history of https://news.ycombinator.com/threads?id=jackfranklyn , they all look the same.

FrontierProject

a month ago

Yeah on a second look GP might actually be on to something here. Jackfranklyn only makes top level comments, never dialogs with anyone, and I count at least 3 instances of "as someone who does this for a living" that are too seperated in scope to be plausibly realistic.

tripdout

a month ago

This article reads like AI

dragonwriter

a month ago

“Comment I don't like is a bot” is the new “Comment I don’t like is a product of the HN hivemind conspiracy”.

dfajgljsldkjag

a month ago

The comment isn't saying anything controversial so why would I dislike it or want an excuse to throw shade on it?

It's a bot. Period.

dragonwriter

a month ago

You might notice I wasn't responding to your specific claim about a particular comment but to a later post by a different poster commenting on a wider phenomenon. Perhaps stop trying so hard to insert the idea you want to argue against into posts where it doesn't actually exist just so you can have something to argue about. (Especially given there are many direct responses to your post actually arguing with your claim that you could instead argue with.)

shpongled

a month ago

Unclear why you think this is ChatGPT, doesn't read like it at all to me. Many people - myself included - use punctuation to emphasize and clarify.

NitpickLawyer

a month ago

The tells are in the cadence. And the not x but y. And the last line that basically says nothing, while using big words. It's like "In conclusion", but worded differently. Enough tells for me to click on their history. They have the exact same cadence on every comment. It's a bit more sophisticated than "chatgpt write a reply", but it's still 100% aigen. Check it out, you'll see it after a few messages in their history.

dfajgljsldkjag

a month ago

That comment has tons of AI tells, not simply a few punctuation.

shpongled

a month ago

No, it doesn't. The "I'm an expert at AI detection" crowd likes to cite things like "It's not X, it's Y" and other expression patterns without stopping to think that perhaps LLMs regurgitate those patterns because they are frequently used in written speech.

I assign a <5% probability that GP comment was AI written. It's easy to tell, because AI writing has no soul.

NitpickLawyer

a month ago

The message is 100% AI written. And if you click on their username and check their comment history you'll see that ALL their comments are "identical". Just do it, you'll see it by the 5th message. No one talks like that. No one talks like that on every message.

ewoodrich

a month ago

Exactly, if a comment just feels a little off but you're unsure, do a quick scan of the profile, takes 15-30 seconds at most to get sufficient signal.

If it's actually AI, the pattern becomes extremely obvious reading them back-to-back. If no clear pattern, I'll happily give them the benefit of the doubt at that point. I don't particularly care if someone occasionally cleans up a post with an LLM as long as there is a real person driving it and it's not overused.

The other day on Reddit I saw a post in r/sysadmin that absolutely screamed karma farming AI and it was really depressing seeing a bunch of people defending them as the victim of an anti-AI mob without noticing the entire profile was variations of generic "Does anyone else dislike [Tool X], am I alone? [generic filler] What does everyone else think?" posts.

shpongled

a month ago

Looking at their profile I'm inclined to agree. But I think in isolation, this one post isn't setting off enough red flags for me. At the very least, they aren't just using default prompts.

sponnath

a month ago

I think at this point it's not easy to accurately detect whether or not something is AI written. A real person can definitely write like this. In fact, that's probably where the LLMs got their writing style from.

prodigycorp

a month ago

It doesn’t read like ChatGPT at all. It is well written, hardly a crime for a comment section.

handfuloflight

a month ago

Right. It's Claude.

igravious

a month ago

GP defo did not tripper my AI slop detector :/

stevenslade

a month ago

I’d been wondering how conversation history actually works in these agent loops — the LLM itself has no memory, so whatever “history” exists is just text you keep feeding back in.

At a high level it seems to usually be one (or a mix) of:

- full transcript appended every turn

- sliding window of the last N turns / tokens

- older turns summarized into a rolling memory

- structured state (goals, decisions, progress) rendered into the prompt

- external storage + retrieval (RAG-style) to pull in only relevant past info

Under the hood I’m sure it gets more complex, but the core idea is pretty simple once you strip away the mystique: memory = prompt assembly.

bilater

a month ago

I'm curious how tools like Claude Code or Cursor edit code. Do they regenerate the full file and diff it, or do they just output a diff and apply that directly? The latter feels more efficient, but harder to implement.

cmiles8

a month ago

A lot of software is like this. You can build a bare bones but functional version for 1x investment or something that addresses every bell and whistle (often with market research saying it’s really needed) for 1000x. The 1000x version is better, but not remotely 1000x better.

A lot of SaaS has turned into this too. Take a bloated monstrosity like Salesforce and I bet 95% of customers would be very happy with a “bare bones” version that costs 1 10th the price.

loeg

a month ago

Would be nice to have a different wrapper around Claude, even something bare bones, as long as it's easy to modify. Claude code terminal has the jankiest text editor I've ever used -- holding down backspace just moves the cursor. It's something about the key repeat rate. If you manually press backspace repeatedly, it's fine, but a rapid repeat rate confuses the editor. (I'm pretty sure neither GNU readline nor BSD editline do this.) It makes editing prompts a giant pain in the ass.

thierrydamiba

a month ago

What editor are you using?

loeg

a month ago

Whatever interface you get running the claude cli.

thierrydamiba

a month ago

Try ghostty, wterm, or kitty as the terminal you run Claude code from. Much better experience.

emsign

a month ago

> The LLM never actually touches your filesystem.

But that's not correct. You give them write access to files it then compiles and executes. It could include code that then runs with the rights of the executing user to manipulate the system. It already has one foot past the door. And you'd have to set up all kinds of safeguards to make sure it doesn't walk outside completely.

It's a fundamental problem if you give agentic AI rights on your system. Which in contrast kind of is the whole purpose of agentic AI.

duncancarroll

a month ago

> "But here’s the thing"

This phrase feels like the new em dash...

lucideer

a month ago

This is a really great post, concise & clear & educational. I do find the title slightly ironic though when the code example goes on to immediately do "import anthropic" right up top.

(it's just a http library wrapping anthropic's rest API; reimplementing it - including auth - would add enough boilerplate to the examples to make this post less useful, but I just found it funny alongside the title choice)

wrochow

a month ago

I always laugh when I see a post that claims some tech is "easy". Sure, if all you're doing is creating a hello world script. But try to do something for which a client would pay you more than 25¢ for. Go ahead, try it. We'll wait (!). What about context length, or code validation, architecture, planning, large files...

Oh, yes it's easy. That's just so cute.

rcarmo

a month ago

I think mine have a little more code, but they also have a lot more tools:

- https://github.com/rcarmo/bun-steward

- https://github.com/rcarmo/python-steward (created with the first one)

And they're self-replicating!

domlebo70

a month ago

I don't code in Python much. Are those type annotations really how people are using them, or is it just for the example?

    def list_files_tool(path: str) -> Dict[str, Any]:

And it returns

    {
        "path": str(full_path),
        "files": all_files
    }

Is that useful?

utopiah

a month ago

Why limit it to few tools from a tool registry when running in a full sandbox using QEMU or thinner like Podman/Docker literally takes 10 lines of code? You can still use your real files with a mount point to a directory.

To be clear I'm not implying any of that is useful but if you do want to go down that path then why not actually do it?

cadamsdotcom

a month ago

The devil is in the details, so actually, the Emperor in this analogy most definitely does have clothes.

For example, post-training / finetuning the model specifically to use the tools it’ll be given in the harness. Or endlessly tweaking the system prompt to fine-tune the model’s behavior to a polish.

Plus - both OpenAI and Qwen have models specifically intended for coding.

pbw

a month ago

I have mixed feelings about the "Do X in N lines of code" genre. I applaud people taking the time to boil something down to its very essence, and implement just that, but I feel like the tone is always, "and the full thing is lame because it's so big," which seems off to me.

utopiah

a month ago

I do prototyping for a living and ... I definitely do "X in 1/100th lines of code" regularly.

It's exciting, liberating... but it's a lie. What I do is to get the CORE of the idea so that I fully understand it. It's really nice because I get a LOT of millage very quickly... but it's also brittle, very brittle.

My experience is that most projects are 100x bigger than the idea they embody because the "real World" is damn messy. There are always radically more edge cases than the main idea enables. At some point you have to draw a line but the furthest away you draw the line, the more code you need to do it.

So... you are right to have mixed feeling, the tiny version is only valuable to get the point but it's not something one can actually use in production.

ozim

a month ago

Magic is not agent, magic is neural network that was trained.

Yeah I agree there is bunch of BS tools on top that basically try to coerce people into paying and using their setup so they become dependent on that provider that provides some value but still they are so pushy that it is quite annoying.

egeozcan

a month ago

claude opus 4.5 is much more impressive when used in claude code. I also tried it through antigravity but from a users perspective, claude code is magic.

akhil08agrawal

a month ago

Nice breakdown. Been thinking about this a lot lately - if the core is this simple, what actually becomes the hard part?

Feels like we're headed toward a world where everyone can build these loops easily. Curious what you think separates good uses of these agents from mediocre ones.

kirjavascript

a month ago

here's my take, in 70 lines of code: https://github.com/kirjavascript/nanoagent/blob/master/nanoa...

fragmede

a month ago

I mean, if you take out the guard rails, here's codex in 46 lines of bash:

    #!/usr/bin/env bash
    set -euo pipefail
    
    # Fail fast if OPENAI_API_KEY is unset or empty
    : "${OPENAI_API_KEY:?set OPENAI_API_KEY}"
    MODEL="${MODEL:-gpt-5.2-chat-latest}"
    
    extract_text_joined() {
      # Collect all text fields from the Responses API output and join them
      jq -r '[.output[]?.content[]? | select(has("text")) | .text] | join("")'
    }
    
    apply_writes() {
      local plan="$1"
      echo "$plan" | jq -c '.files[]' | while read -r f; do
        local path content
        path="$(echo "$f" | jq -r '.path')"
        content="$(echo "$f" | jq -r '.content')"
        mkdir -p "$(dirname "$path")"
        printf "%s" "$content" > "$path"
        echo "wrote $path"
      done
    }
    while true; do
      printf "> "
      read -r USER_INPUT || exit 0
      [[ -z "$USER_INPUT" ]] && continue
      # File list relative to cwd
      TREE="$(find . -type f -maxdepth 6 -print | sed 's|^\./||')"
      USER_JSON="$(jq -n --arg task "$USER_INPUT" --arg tree "$TREE" \
        '{task:$task, workspace_tree:$tree,
          rules:[
            "Return ONLY JSON matching the schema.",
            "Write files wholesale: full final content for each file.",
            "If no file changes are needed, return files:[]"
          ] }')"
      RESP="$(
        curl -s https://api.openai.com/v1/responses \
          -H "Authorization: Bearer $OPENAI_API_KEY" \
          -H "Content-Type: application/json" \
          -d "$(jq -n --arg model "$MODEL" --argjson user "$USER_JSON" '{model:$model,input:[{role:"system",content:"You output only JSON file-write plans."},{role:"user",content:$user}],text:{format:{type:"json_schema",name:"file_writes",schema:{type:"object",additionalProperties:false,properties:{files:{type:"array",items:{type:"object",additionalProperties:false,properties:{path:{type:"string"},content:{type:"string"}},required:["path","content"]}}},required:["files"]}}}')"
      )"
      PLAN="$(printf "%s" "$RESP" | extract_text_joined)"
      apply_writes "$PLAN"
    done

dave1010uk

a month ago

Impressive!

Here's an agent in 24 lines of PHP, written in 2023. But it relies on `llm` to do HTTP and JSON.

https://github.com/dave1010/hubcap

bochoh

a month ago

I’ve had decent results with this for context management in large code bases so far https://github.com/GMaN1911/claude-cognitive

oars

a month ago

Can't believe Claude Code was launched less than a year ago. It's certainly a new tool in my kit and allowed me to approach and work on new coding problems in a new way.

oli5679

a month ago

https://github.com/mistralai/mistral-vibe

This is a really nice open source coding agent implementation. The use of async is interesting.

voidhorse

a month ago

Wait, did people in the tech world not know this? If this is some shocking revelation to engineers we are in trouble.

Any and every "AI" experience is just kiddie level program mg wrapping LLMs.

kristopolous

a month ago

The source code link at the bottom of the article goes to YouTube for some reason

user

a month ago

[deleted]

mephos

a month ago

How much Claude could a Claude code code if a Claude code could code Claude

santiagobasulto

a month ago

I'm surprised this post has so many upvotes. This is a gross oversimplification of what Claude Code (and other agents can do). On top of that, it's very poorly engineered.

all2

a month ago

> Poorly engineered

How so? As a pedagogic tool it seems adequate. How would you change this to be better (either from a teaching standpoint or from a SWE standpoint)?

OsrsNeedsf2P

a month ago

Yea.. our startup greatly overestimated how hard it is to make a good agent loop. Handling exit conditions, command timeouts, context management, UI, etc is surprisingly hard to do seamlessly.

nvader

a month ago

I'm curious which startup, if you wouldn't mind sharing?

For reciprocity, I work at Imbue, and we can also attest to the real work complexities of this domain.

OsrsNeedsf2P

a month ago

Called Ziva[0], we do AI agents for game development. If you want to jump on a call and discuss strategies, my email is in my bio

https://ziva.sh/

bjt12345

a month ago

Do you mean they underestimated how hard it is?

OsrsNeedsf2P

a month ago

No, overestimated. You can make a terrible CC in 200 LoC, but after using it for 3 minutes you'll realize how much more goes into it

lemontheme

a month ago

Gotta admit, the phrasing tripped me up as well. You underestimated the effort that ultimately went into it

nxobject

a month ago

I'll admit that I'm tickled pink by the idea of a coding agent recreating itself. Are we at a point where agents can significantly and autonomously improve themselves?

__0x01

a month ago

These LLM tools appear to have an unprecedented amount of access to the file systems of their users.

Is this correct, and if so do we need to be concerned about user privacy and security?

fragmede

a month ago

We should be absolutely terrified about the amount of access these things have to users systems. Of course there is advice to use a sandbox but there are stupid people out there (I'm one of them) who disregard this advice because it's too cumbersome, so Claude is being run in yolo mode, on the same machine that has access access to bank accounts, insurance, password manager and crypto private keys.

user

a month ago

[deleted]

andai

a month ago

    from ddgs import DDGS

    def web_search(query: str, max_results: int = 8) -> list[dict]:
        return DDGS().text(query, max_results=max_results)

erelong

a month ago

Kind of a meta thought but I guess we could just ask a LLM to guide us through creating some of these things ourselves, right? (Tools, agents, etc.)

mudkipdev

a month ago

Is this not using constrained decoding? You should pass the tool schemas to the "tools" parameter to make sure all tool calls are valid

fb03

a month ago

"I'm using OpenAI here"

proceeds to show a piece of code importing anthropic

was pretty confusing to me

_def

a month ago

Are there useful open source "agents" already ready to use with local LLMs?

Waterluvian

a month ago

How much code could Claude Code code if Claude Code could code Claude?

dangoodmanUT

a month ago

This definitely will handle large files or binary files very poorly

computerex

a month ago

I think some of the commenters are missing the point of this article. Claude Code is a low level harness around the model. Low level thin wrappers are unreasonably effective at code editing. My takeaway is that imagine how good code editing systems will be once our tools are not merely wrappers around the model. Once we have tall vertical systems that use engineering+llms to solve big chunks of problems. I could imagine certain classes of software being "solved".

Imagine a SDK that's dedicated to customizing tools like claude code/cursor cli to produce a class of software like b2b enterprise saas. Within the bounds of the domain(s) modeled these vertical systems would ultimately even crush the capabilities of thin low level wrappers we have today.

erichocean

a month ago

The tip of the sphere in agentic code harnesses today is to RL train them as dedicated conductor/orchestrator models.

Not 200 lines of Python.

aszen

a month ago

Can you elaborate on this?

erichocean

a month ago

Here you go: https://research.nvidia.com/labs/lpr/ToolOrchestra/

Big models (like Claude Opus 4.5) can (and do) just RL-train this into the main model.

8note

a month ago

as a comparison, the gemini cli agent with gemini 2 half the time writes its own tool call parameters incorrectly. it didnt quite know when to make a tool call, which tool result was the most recent(it always assumed the first one was the one to use, rather than the last one, when multiple reads of the same file were in context) etc.

gemini 3 has pretty clearly been trained for this workflow of text output, since it can actually get the right calls in the first shot most of the time, and pays attention to the end of the context and not just the start.

gemini 3 is sitting within a format of text that it has been trained to be in, where for gemini 2, it only had the prompt to tell it how to work within the tool

d4rkp4ttern

a month ago

This is consistent with how I've defined the 3 core elements of an agent:

- Intelligence (the LLM)

- Autonomy (loop)

- Tools to have "external" effects

Wrinkles that I haven't seen discussed much are:

(1) Tool-forgetting: LLM forgets to call a tool (and instead outputs plain text). Some may say that these concerns will disappear as frontier models improve, there will always be a need for having your agent scaffolding work well with weaker LLMs (cost, privacy, etc), and as long as the model is stochastic there will always be a chance of tool-forgetting.

(2) Task-completion-signaling: Determining when a task is finished. This has 2 sub-cases: (2a) we want the LLM to decide that, e.g. search with different queries until desired info found, (2b) we want to specify deterministic task completion conditions, e.g., end the task immediately after structured info extraction, or after acting on such info, or after the LLM sees the result of that action etc.

After repeatedly running into these types of issues in production agent systems, we’ve added mechanisms for these in the Langroid[1] agent framework, which has blackboard-like loop architecture that makes it easy to incorporate these.

For issue (1) we can configure an agent with a `handle_llm_no_tool` [2] set to a “nudge” that is sent back to the LLM when a non-tool response is detected (it could also be set as a lambda function to take other possible actions). As others have said, grammar-based constrained decoding is an alternative but only works for LLM-APIs that support.

For issue (2a) Langroid has a DSL[3] for specifying task termination conditions. It lets you specify patterns that trigger task termination, e.g.

- "T" to terminate immediately after a tool-call,

- "T[X]" to terminate after calling the specific tool X,

- "T,A" to terminate after a tool call, and agent handling (i.e. tool exec)

- "T,A,L" to terminate after tool call, agent handling, and LLM response to that

For (2b), in Langroid we rely on tool-calling again, i.e. the LLM must emit a specific DoneTool to signal completion. In general we find it useful to have orchestration tools for unambiguous control flow and message flow decisions by the LLM [4].

[1] Langroid https://github.com/langroid/langroid

[2] Handling non-tool LLM responses https://langroid.github.io/langroid/notes/handle-llm-no-tool...

[3] Task Termination in Langroid https://langroid.github.io/langroid/notes/task-termination/

[4] Orchestration Tools: https://langroid.github.io/langroid/reference/agent/tools/or...

user

a month ago

[deleted]

hooverd

a month ago

Woah now, did you get Anthropic's permission to use Claude outside of Claude Code?

wizzard0

a month ago

wdym? claude code is just a wrapper

you can get yourself an api key at console.anthropic.com and build whatever you want

(i use local models where possible but must admit opus45 is good)

Python is quite garbage for this sort of thing.

tomhow

a month ago

Sure, and the opposite of shallow dismissal isn't shallow praise, it's substantive explanation of the reasoning behind one's assessment.