antirez
20 days ago
Something that may be interesting for the reader of this thread: this project was possible only once I started to tell Opus that it needed to take a file with all the implementation notes, and also accumulating all the things we discovered during the development process. And also, the file had clear instructions to be taken updated, and to be processed ASAP after context compaction. This kinda enabled Opus to do such a big coding task in a reasonable amount of time without loosing track. Check the file IMPLEMENTATION_NOTES.md in the GitHub repo for more info.
soulofmischief
20 days ago
It's funny watching people rediscover well-established paradigms. Suddenly everyone's recreating software design documents [0].
People can say what they want about LLMs reducing intelligence/ability; The trend has clearly been that people are beginning to get more organized, document things better, enforce constraints, and think in higher-level patterns. And there's renewed interest in formal verification.
LLMs will force the skilled, employable engineer to chase both maintainability and productivity from the start, in order to maintain a competitive edge with these tools. At least until robots replace us completely.
[0] https://www.atlassian.com/work-management/knowledge-sharing/...
falloutx
20 days ago
The thing is that currently most of these projects are just done by engineers, Its easy to stay organized when the project lasts couple of weeks and stays within <5 engineers. The issues starts when the software starts living longer and you add in the modern agile practices, it comes a complete mess which each PM trying to add random features on top of the existing code. As you increase more and more code, the maintainability will just become impossible.
adw
18 days ago
> The issues starts when the software starts living longer
There's going to be a bifurcation; caricaturing it, "operating system kernels" and "disposable code". In the latter case, you don't maintain it; you dispose of it and vibe-code up a new one.
soulofmischief
19 days ago
I am aware that software complexity scales. That is literally why I suggested that having good standards from the start is becoming increasingly important.
vessenes
20 days ago
Salvatore - this is cool. I am a fan of using Steve Yegge's beads for this - it generally cuts the markdown file cruft significantly.
Did you run any benchmarking? I'm curious if python's stack is faster or slower than a pure C vibe coded inference tool.
samtheprogram
20 days ago
There’s benchmarks in the README. Python is ~10x faster. It’s heavily optimized. Based on the numbers and my experience with Flux.1, I’m guessing the Python run is JIT’d (or Flux.2 is faster), although it’d likely only be ~half as fast if it weren’t (i.e. definitely not 10x slower).
antirez
19 days ago
There are a lot of shortcomings in the current implementation, making it slow (but in my tree is 2x faster as we speak). For instance activations aren't taken in the GPU, kernels are not fused, flash attention is not used, and many other issues. Now I'll focus on that changes to approach PyTorch numbers a little bit more.
lukebechtel
20 days ago
Very cool!
Yep, a constantly updated spec is the key. Wrote about this here:
https://lukebechtel.com/blog/vibe-speccing
I've also found it's helpful to have it keep an "experiment log" at the bottom of the original spec, or in another document, which it must update whenever things take "a surprising turn"
ctoth
20 days ago
Honest question: what do you do when your spec has grown to over a megabyte?
Some things I've been doing:
- Move as much actual data into YML as possible.
- Use CEL?
- Ask Claude to rewrite pseudocode in specs into RFC-style constrained language?
How do you sync your spec and code both directions? I have some slash commands that do this but I'm not thrilled with them?
I tend to have to use Gemini for actually juggling the whole spec. Of course it's nice and chunked as much as it can be? but still. There's gonna need to be a whole new way of doing this.
If programming languages can have spooky language at a distance wait until we get into "but paragraph 7, subsection 5 of section G clearly defines asshole as..."
What does a structured language look like when it doesn't need mechanical sympathy? YML + CEL is really powerful and underexplored but it's still just ... not what I'm actually wanting.
lukebechtel
20 days ago
Sharding or compaction, both possible with LLMs.
Sharding: Make well-named sub-documents for parts of work. LLM will be happy to create these and maintain cross references for you.
Compaction: Ask the LLM to compact parts of the spec, or changelog, which are over specified or redundant.
ctoth
20 days ago
My question was something like: what is the right representation for program semantics when the consumer is an LLM and the artifact exceeds context limits?
"Make sub-documents with cross-references" is just... recreating the problem of programming languages but worse. Now we have implicit dependencies between prose documents with no tooling to track them, no way to know if a change in document A invalidates assumptions in document B, no refactoring support, no tests for the spec.
To make things specific:
lukebechtel
20 days ago
Ah, I see your point more clearly now.
At some level you have to do semantic compression... To your point on non-explicitness -- the dependencies between the specs and sub-specs can be explicit (i.e. file:// links, etc).
But your overall point on assumption invalidation remains... Reminds me of a startup some time ago that was doing "Automated UX Testing" where user personas (i.e. prosumer, avg joe, etc) were created, and Goals/ Implicit UX flows through the UI were described (i.e. "I want to see my dashboard", etc). Then, an LLM could pretend to be each persona, and test each day whether that user type could achieve the goals behind their user flow.
This doesn't fully solve your problem, but it hints at a solution perhaps.
Some of what you're looking for is found by adding strict linter / tests. But your repo looks like something in an entirely different paradigm and I'm curious to dig into it more.
vidarh
19 days ago
Telling it to maintain a list of areas that needs work, with references to specs for those specific areas has worked well for me.
anonzzzies
19 days ago
We found, especially with Opus and recent claude code that it is better/more precise at reading existing code for figuring out what the current status is than reading specs. It seems (for us) it is less precise at 'comprehending' the spec English than it is the code and that will sometimes reflect in wrong assumptions for new tasks which will result in incorrect implementations of those tasks. So we dropped this. Because of caching, it doesn't seem too bad on the tokens either.
nonethewiser
19 days ago
Specs with agents seem destined for drift. It'll randomly change something you dont know about and it will go too fast for you to really keep it updated. I went from using claude code totally naively to using little project management frameworks to now just using it by itself again. Im gettin the best results like this, and usually start in planning mode (unless the issue is quite small/clear).
My experience has been that it gets worse with more structure. You misinform it and heavily bias it's results in ways you dont intend. Maybe there are AI wizards out there with the perfect system of markdown artifacts but I found it increased the trouble a lot and made the results worse. It's a non deterministic system. Knock yourself out tryin to micromanage it.
celadin
20 days ago
I'm still sharing this post in the internal org trainings I run for those new to LLMs. Thanks for it - really great overview of the concept!
I saw in your other comment you've made accommodations for the newer generation, and I will confess than in Cursor (with plan mode) I've found an abbreviated form works just as well as the extremely explicit example found in the post.
If you ever had a followup, I imagine it'd be just as well received!
daliusd
20 days ago
Looks like default OpenCode / Claude Code behavior with Claude models. Why the extra prompt ?
lukebechtel
20 days ago
Good question!
1. The post was written before this was common :)
2. If using Cursor (as I usually am), this isn't what it always does by default, though you can invoke something like it using "plan" mode. It's default is to keep todo items in a little nice todo list, but that isn't the same thing as a spec.
3. I've found that Claude Code doesn't always do this, for reasons unknown to me.
4. The prompt is completely fungible! It's really just an example of the idea.
AINoob2026
20 days ago
This is amazing. Is there any way you could share the log of prompts you used and other things aside from the implementation notes to reach such a result? Would love to learn from your experience and steps. Thank you
bloudermilk
20 days ago
Do you plan on writing about the other lessons you learned, which you mentioned in the README? As a big fan of your software and writing for many years, I would deeply appreciate your perspective using these tools!
echelon
20 days ago
> No Python runtime, no PyTorch, no CUDA toolkit required at inference time.
This is amazing, Salvatore! Please spend some more time here and free us from the CUDA toolkit and Python.
terhechte
20 days ago
There're multiple task solutions for Claude or other llms that let it define tasks, add implementation notes and (crucially) add sub-tasks and dependencies. I'm using Beads (https://github.com/steveyegge/beads) and I think it really improves the outcome; especially for larger projects.
thundergolfer
20 days ago
Was the LLM using vision capabilities to verify the correctness of it's work? If so, how was that verification method guided by you?
antirez
20 days ago
Yes, Opus could check the image to see if it matched the prompt, but I adviced the model to stop and ask the human for a better check and a description of what the cause of the corrupted image could be. But the fact it could catch obvious regressions was good.
krschacht
14 days ago
antirez — how do you reliably get Claude to re-read the file after compaction? It's easy to let Claude run for awhile, it compacts and starts getting much worse after compaction, and I don't always catch the moment of compaction to be able to tell it to re-read the notes file.
tucnak
20 days ago
This development workcycle pattern lends nicely to Antigravity, which kind of does 80% this out the box, and can be nudged to do the rest with a little bit of prompting.
vient
19 days ago
Peculiar that in IMPLEMENTATION_NOTES.md Claude thinks it is 2024 and not 2026 (see Work Log)
dostick
20 days ago
So Codex would do that task with regular spec and no recompacting?
motoboi
19 days ago
maybe you should experiment with gpt-5.1-codex-max, which has the new compaction algorithm that gpt-5.2-codex seems to lack.