hackernews client

Flux 2 Klein pure C inference

453 pointsposted 20 days ago

150 Comments

antirez

20 days ago

Something that may be interesting for the reader of this thread: this project was possible only once I started to tell Opus that it needed to take a file with all the implementation notes, and also accumulating all the things we discovered during the development process. And also, the file had clear instructions to be taken updated, and to be processed ASAP after context compaction. This kinda enabled Opus to do such a big coding task in a reasonable amount of time without loosing track. Check the file IMPLEMENTATION_NOTES.md in the GitHub repo for more info.

soulofmischief

20 days ago

It's funny watching people rediscover well-established paradigms. Suddenly everyone's recreating software design documents [0].

People can say what they want about LLMs reducing intelligence/ability; The trend has clearly been that people are beginning to get more organized, document things better, enforce constraints, and think in higher-level patterns. And there's renewed interest in formal verification.

LLMs will force the skilled, employable engineer to chase both maintainability and productivity from the start, in order to maintain a competitive edge with these tools. At least until robots replace us completely.

[0] https://www.atlassian.com/work-management/knowledge-sharing/...

falloutx

20 days ago

The thing is that currently most of these projects are just done by engineers, Its easy to stay organized when the project lasts couple of weeks and stays within <5 engineers. The issues starts when the software starts living longer and you add in the modern agile practices, it comes a complete mess which each PM trying to add random features on top of the existing code. As you increase more and more code, the maintainability will just become impossible.

adw

19 days ago

> The issues starts when the software starts living longer

There's going to be a bifurcation; caricaturing it, "operating system kernels" and "disposable code". In the latter case, you don't maintain it; you dispose of it and vibe-code up a new one.

soulofmischief

20 days ago

I am aware that software complexity scales. That is literally why I suggested that having good standards from the start is becoming increasingly important.

vessenes

20 days ago

Salvatore - this is cool. I am a fan of using Steve Yegge's beads for this - it generally cuts the markdown file cruft significantly.

Did you run any benchmarking? I'm curious if python's stack is faster or slower than a pure C vibe coded inference tool.

samtheprogram

20 days ago

There’s benchmarks in the README. Python is ~10x faster. It’s heavily optimized. Based on the numbers and my experience with Flux.1, I’m guessing the Python run is JIT’d (or Flux.2 is faster), although it’d likely only be ~half as fast if it weren’t (i.e. definitely not 10x slower).

antirez

20 days ago

There are a lot of shortcomings in the current implementation, making it slow (but in my tree is 2x faster as we speak). For instance activations aren't taken in the GPU, kernels are not fused, flash attention is not used, and many other issues. Now I'll focus on that changes to approach PyTorch numbers a little bit more.

lukebechtel

20 days ago

Very cool!

Yep, a constantly updated spec is the key. Wrote about this here:

https://lukebechtel.com/blog/vibe-speccing

I've also found it's helpful to have it keep an "experiment log" at the bottom of the original spec, or in another document, which it must update whenever things take "a surprising turn"

ctoth

20 days ago

Honest question: what do you do when your spec has grown to over a megabyte?

Some things I've been doing:

- Move as much actual data into YML as possible.

- Use CEL?

- Ask Claude to rewrite pseudocode in specs into RFC-style constrained language?

How do you sync your spec and code both directions? I have some slash commands that do this but I'm not thrilled with them?

I tend to have to use Gemini for actually juggling the whole spec. Of course it's nice and chunked as much as it can be? but still. There's gonna need to be a whole new way of doing this.

If programming languages can have spooky language at a distance wait until we get into "but paragraph 7, subsection 5 of section G clearly defines asshole as..."

What does a structured language look like when it doesn't need mechanical sympathy? YML + CEL is really powerful and underexplored but it's still just ... not what I'm actually wanting.

lukebechtel

20 days ago

Sharding or compaction, both possible with LLMs.

Sharding: Make well-named sub-documents for parts of work. LLM will be happy to create these and maintain cross references for you.

Compaction: Ask the LLM to compact parts of the spec, or changelog, which are over specified or redundant.

ctoth

20 days ago

My question was something like: what is the right representation for program semantics when the consumer is an LLM and the artifact exceeds context limits?

"Make sub-documents with cross-references" is just... recreating the problem of programming languages but worse. Now we have implicit dependencies between prose documents with no tooling to track them, no way to know if a change in document A invalidates assumptions in document B, no refactoring support, no tests for the spec.

To make things specific:

https://github.com/ctoth/polyarray-spec

lukebechtel

20 days ago

Ah, I see your point more clearly now.

At some level you have to do semantic compression... To your point on non-explicitness -- the dependencies between the specs and sub-specs can be explicit (i.e. file:// links, etc).

But your overall point on assumption invalidation remains... Reminds me of a startup some time ago that was doing "Automated UX Testing" where user personas (i.e. prosumer, avg joe, etc) were created, and Goals/ Implicit UX flows through the UI were described (i.e. "I want to see my dashboard", etc). Then, an LLM could pretend to be each persona, and test each day whether that user type could achieve the goals behind their user flow.

This doesn't fully solve your problem, but it hints at a solution perhaps.

Some of what you're looking for is found by adding strict linter / tests. But your repo looks like something in an entirely different paradigm and I'm curious to dig into it more.

vidarh

20 days ago

Telling it to maintain a list of areas that needs work, with references to specs for those specific areas has worked well for me.

anonzzzies

20 days ago

We found, especially with Opus and recent claude code that it is better/more precise at reading existing code for figuring out what the current status is than reading specs. It seems (for us) it is less precise at 'comprehending' the spec English than it is the code and that will sometimes reflect in wrong assumptions for new tasks which will result in incorrect implementations of those tasks. So we dropped this. Because of caching, it doesn't seem too bad on the tokens either.

nonethewiser

20 days ago

Specs with agents seem destined for drift. It'll randomly change something you dont know about and it will go too fast for you to really keep it updated. I went from using claude code totally naively to using little project management frameworks to now just using it by itself again. Im gettin the best results like this, and usually start in planning mode (unless the issue is quite small/clear).

My experience has been that it gets worse with more structure. You misinform it and heavily bias it's results in ways you dont intend. Maybe there are AI wizards out there with the perfect system of markdown artifacts but I found it increased the trouble a lot and made the results worse. It's a non deterministic system. Knock yourself out tryin to micromanage it.

celadin

20 days ago

I'm still sharing this post in the internal org trainings I run for those new to LLMs. Thanks for it - really great overview of the concept!

I saw in your other comment you've made accommodations for the newer generation, and I will confess than in Cursor (with plan mode) I've found an abbreviated form works just as well as the extremely explicit example found in the post.

If you ever had a followup, I imagine it'd be just as well received!

daliusd

20 days ago

Looks like default OpenCode / Claude Code behavior with Claude models. Why the extra prompt ?

lukebechtel

20 days ago

Good question!

1. The post was written before this was common :)

2. If using Cursor (as I usually am), this isn't what it always does by default, though you can invoke something like it using "plan" mode. It's default is to keep todo items in a little nice todo list, but that isn't the same thing as a spec.

3. I've found that Claude Code doesn't always do this, for reasons unknown to me.

4. The prompt is completely fungible! It's really just an example of the idea.

AINoob2026

20 days ago

This is amazing. Is there any way you could share the log of prompts you used and other things aside from the implementation notes to reach such a result? Would love to learn from your experience and steps. Thank you

bloudermilk

20 days ago

Do you plan on writing about the other lessons you learned, which you mentioned in the README? As a big fan of your software and writing for many years, I would deeply appreciate your perspective using these tools!

echelon

20 days ago

> No Python runtime, no PyTorch, no CUDA toolkit required at inference time.

This is amazing, Salvatore! Please spend some more time here and free us from the CUDA toolkit and Python.

terhechte

20 days ago

There're multiple task solutions for Claude or other llms that let it define tasks, add implementation notes and (crucially) add sub-tasks and dependencies. I'm using Beads (https://github.com/steveyegge/beads) and I think it really improves the outcome; especially for larger projects.

thundergolfer

20 days ago

Was the LLM using vision capabilities to verify the correctness of it's work? If so, how was that verification method guided by you?

antirez

20 days ago

Yes, Opus could check the image to see if it matched the prompt, but I adviced the model to stop and ask the human for a better check and a description of what the cause of the corrupted image could be. But the fact it could catch obvious regressions was good.

krschacht

14 days ago

antirez — how do you reliably get Claude to re-read the file after compaction? It's easy to let Claude run for awhile, it compacts and starts getting much worse after compaction, and I don't always catch the moment of compaction to be able to tell it to re-read the notes file.

tucnak

20 days ago

This development workcycle pattern lends nicely to Antigravity, which kind of does 80% this out the box, and can be nudged to do the rest with a little bit of prompting.

vient

19 days ago

Peculiar that in IMPLEMENTATION_NOTES.md Claude thinks it is 2024 and not 2026 (see Work Log)

dostick

20 days ago

So Codex would do that task with regular spec and no recompacting?

motoboi

19 days ago

maybe you should experiment with gpt-5.1-codex-max, which has the new compaction algorithm that gpt-5.2-codex seems to lack.

adefa

20 days ago

I ran a similar experiment last month and ported Qwen 3 Omni to llama cpp. I was able to get GGUF conversion, quantization, and all input and output modalities working in less than a week. I submitted the work as a PR to the codebase and understandably, it was rejected.

https://github.com/ggml-org/llama.cpp/pull/18404

https://huggingface.co/TrevorJS/Qwen3-Omni-30B-A3B-GGUF

antirez

20 days ago

The refusal because often AI writes suboptimal GGML kernels looks very odd, to me. It means that who usually writes manually GGML kernels, could very easily steer the model into writing excellent kernels, and even a document for the agents can be compiled with the instructions on how to do a great work. If they continue in this way, soon a llama.cpp fork will emerge that will be developed much faster and potentially even better: it is unavoidable.

rjh29

20 days ago

The refusal is probably because OP said "100% written by AI" and didn't indicate an interest in actually reviewing or maintaining the code. In fact, a later PR comment suggests that the AI's approach was needlessly complicated.

hirako2000

20 days ago

Also because it's a large PR. Also because the maintainer has better things to do than taking longer and more energy to review than the author spent to write it, just to find that multiple optimisations will be requested, which the author may not be able to take on.

the creator of llama.cc can hardly be suspected to be reluctant or biased towards GenAI.

adefa

20 days ago

Absolutely -- it's perfectly understandable. I wanted to be completely upfront about AI usage and while I was willing and did start to break the PR down into parts, it's totally OK for the maintainers to reject that too.

I wanted to see if Claude Code could port the HF / MLX implementation to llama.cpp and it was successful -- in my mind that's wild!

I also learned a ton about GPU programming, how omni models work, and refined my approach to planning large projects with automated end to end integration tests.

The PR was mostly to let people know about the code and weights, since there are quite a few comments requesting support:

https://github.com/ggml-org/llama.cpp/issues/16186

hirako2000

20 days ago

Consider a fork while optimizing. Of Claude can optimize then you could prove someone wrong and get it merged.

Nice work getting multimodal in there already.

nickandbro

20 days ago

I wonder if some of the docs from https://app.wafer.ai/docs could be used to make the model be better at writing GGML kernels. Interesting use case.

nickpsecurity

20 days ago

[flagged]

hirako2000

20 days ago

Not sure we need to invoke Jesus to agree with the liability concerns.

nickpsecurity

20 days ago

[flagged]

user34283

20 days ago

Complete non-issue in my experience.

With usage on a daily basis since GPT-4 I have not once encountered a scenario where I was concerned about the output being complex enough and a verbatim copy to warrant such concerns.

Generally it would seem statistically unlikely to reconstruct a copyrighted work, rather the output should be a probabilistic average. Snippets are typically too common and short to be protected by copyright. Copyright challenges are likely to fail on the "substantial similarity" test.

I understand plaintiffs would need to show that code is virtually identical, not just similar, and that these parts represent a "substantial" portion of the original work's creative value.

user

20 days ago

[deleted]

neomantra

20 days ago

Thanks for sharing this — I appreciate your motivation in the README.

One suggestion, which I have been trying to do myself, is to include a PROMPTS.md file. Since your purpose is sharing and educating, it helps others see what approaches an experienced developer is using, even if you are just figuring it out.

One can use a Claude hook to maintain this deterministically. I instruct in AGENTS.md that they can read but not write it. It’s also been helpful for jumping between LLMs, to give them some background on what you’ve been doing.

antirez

20 days ago

In this case, instead of a prompt I wrote a specification, but later I had to steer the models for hours. So basically the prompt is the sum of all such interactions: incredibly hard to reconstruct to something meaningful.

enriquto

20 days ago

This steering is the main "source code" of the program that you wrote, isn't it? Why throw it away. It's like deleting the .c once you have obtained the .exe

minimaxir

20 days ago

It's more noise than signal because it's disorganized, and hard to glean value from it (speaking from experience).

neomantra

20 days ago

I wasn’t exactly suggesting this. The source code (including SVG or DOCX or HTMl+JS for document work) is the primary ground truth which the LLM modifies. Humans might modify it too. This ground truth is then rendered (compiled, visualized) to the end product.

The PROMPTS.md is communication metadata. Indeed, if you fed the same series of prompts freshly, the resultant ground truths might not make sense because of the stochastic nature of LLMs.

Maybe “ground truth” isn’t exactly the right word, but it is the consistent, determined basis which formed from past work and will evolve with future work.

enriquto

19 days ago

> because of the stochastic nature of LLMs.

But is this "stochastic nature" inherent to the LLM? Can't you make the outputs deterministic by specifying a version of the weights and a seed for the random number generator?

Your vibe coding log (i.e. your source code) may start like this:

    fix weights as of 18-1-2026
    set rng seed to 42

    write a program that prints hello world

Notice that the first two lines may be added automatically by the system and you don't need to write or even see them.

neomantra

19 days ago

I see what you are saying, and perhaps we are zeroing in on the importance of ground truths (even if it is not code but rather PLANs or other docs).

For what you're saying to work, then the LLM must adhere consistently to that initial prompt. Different LLMs and the same LLM on different runs might have different adherence and how does it evolve from there? Meaning at playback of prompt #33, will the ground truth gonna be the same and the next result the same as in the first attempt?

If this is local LLM and we control all the context, then we can control that LLM's seeds and thus get consistent output. So I think your idea would work well there.

I've not started keeping thinking traces, as I'm mostly interested in how humans are using this tech. But, that could get involved in this as well, helping other LLMs understand what happened with a project up to a state.

adw

19 days ago

> But is this "stochastic nature" inherent to the LLM?

At any kind of reasonable scale, yes. CUDA accelerators, like most distributed systems, are nondeterministic, even at zero temperature (which you don't want) with fixed seed.

wyldfire

20 days ago

I've only just started using it but the ralph wiggum / ralph loop plugin seems like it could be useful here.

If the spec and/or tests are sufficiently detailed maybe you can step back and let it churn until it satisfies the spec.

neomantra

20 days ago

Isn't the "steering" in the form of prompts? You note "Even if the code was generated using AI, my help in steering towards the right design, implementation choices, and correctness has been vital during the development." You are a master of this, let others see how you cook, not just taste the sauce!

I only say this as it seems one of your motivations is education. I'm also noting it for others to consider. Much appreciation either way, thanks for sharing what you did.

stellalo

20 days ago

Doesn’t Claude Code allow to just dump entire conversations, with everything that happened in them?

joemazerino

20 days ago

All sessions are located in the `~/.claude/projects/foldername` subdirectory.

ukuina

20 days ago

Doesn't it lose prompts prior to the latest compaction?

jitl

20 days ago

I’ve sent Claude back to look at the transcript file from before compaction. It was pretty bad at it but did eventually recover the prompt and solution from the jsonl file.

onedognight

20 days ago

It’s loses them in the current context (say 200k tokens), not in its SQLite history db (limited by your local storage).

neomantra

20 days ago

I did not know it was SQLite, thx for noting. That gives the idea to make an MCP server or Skill or classical script which can slurp those and make a PROMPTS.md or answer other questions via SQL. Will try that this week.

joemazerino

19 days ago

It doesn't lose the prompt but slowly drains out of context. Use the PreCompact hook to write a summary.

chr15m

20 days ago

aider keeps a log of this, which is incredibly useful.

d_watt

20 days ago

Regarding the meta experiment of using LLMs to transpile to a different language, how did you feel about the outcome / process, and would you do the same process again in the future?

I've had some moments recently for my own projects as I worked through some bottle necks where I took a whole section of a project and said "rewrite in rust" to Claude and had massive speedups with a 0 shot rewrite, most recently some video recovery programs, but I then had an output product I wouldn't feel comfortable vouching for outside of my homelab setup.

antirez

20 days ago

It depends on the situation. In this case the agent worked only using the reference code provided by Flux's Black Forest Labs which is basically just the pipeline implemented as a showcase. The fundamental way for this process to work is that the agent can have a feedback to understand if it is really making progresses, and to debug failures against a reference implementation. But then all the code was implemented with many implementation hints about what I wanted to obtain, and without any reference of other minimal inference libraries or kernels. So I believe this just is the effect of putting together known facts about how Transformers inference works plus an higher level idea of how software should appear to the final user. Btw today somebody took my HNSW implementation for vector sets and translated it to Swift (https://github.com/jkrukowski/swift-hnsw). I'm ok with that, nor I care of this result was obtained with AI or not. However it is nice that the target license is the same, given the implementation is so similar to the C one.

jhatemyjob

20 days ago

When I first saw the OP, panic started to set in that I am fucked and Chat-Completions/LLMs/AI/whatever-you wanna-call-it will soon be able to create anything and eat away at my earning potential. And I will spend my elder years living with roommates, with no wife or children because I will not be able to provide for them. But upon reading that you used a reference implementation, I've realized that you simply managed to leverage it as the universal translator apenwarr believes is the endgame for this new technology [1]. So, now I feel better. I can sleep soundly tonight knowing my livelihood is safe, because the details still matter.

[1] https://apenwarr.ca/log/20251120

tern

20 days ago

Nope, that will happen, but it also doesn't mean you're fucked. It just means it's time to move up the value stack.

The fear that lead to the black and white thinking expressed in your comment is the real issue.

rcarmo

20 days ago

This is pretty great. I’ve gone and hacked your GTE C inference project to Go purely for kicks, but this one I will look at for possible compiler optimizations and building a Mac CLI for scripting…

kubb

20 days ago

This repo has Swift wrappers, not a rewrite of hnsw.c, which apparently you weren't the only author of.

antirez

20 days ago

Thanks,I thought it was a complete rewrite of the same logic and algorithms.

rcarmo

20 days ago

I have a set of prompts that are essentially “audit the current code changes for logic errors” (plus linting and testing, including double checking test conditions) and I run them using GPT-5.x-Codex on Claude generated code.

It’s surprising how much even Opus 4.5 still trips itself up with things like off-by-one or logic boundaries, so another model (preferably with a fresh session) can be a very effective peer reviewer.

So my checks are typically lint->test->other model->me, and relatively few things get to me in simple code. Contrived logic or maths, though, it needs to be all me.

user

20 days ago

[deleted]

kristianp

20 days ago

Note that the original FLUX.2 [klein] model [1] and python code was only released about 3 days ago (inexact without knowing the times and timezones involved.) Discussed at [2]

[1] https://bfl.ai/blog/flux2-klein-towards-interactive-visual-i...

[2] https://news.ycombinator.com/item?id=46653721

p1esk

20 days ago

I wonder how long it would have taken antirez without opus

jabedude

20 days ago

Salvatore, how did you pick up the requisite background knowledge on this subject? IIRC this is your first OSS project in the ML domain, just curious if/how much Claude was helpful with providing you with domain expertise while building this engine

antirez

20 days ago

Hello, I always used to play with AI. I wrote this, some time ago, just to make an example:

https://github.com/antirez/gguf-tools

And I have a YouTube channel mostly about AI (in Italian language) where I regularly post videos and often read papers that I then explain in the channel. I have a long time passion about AI, I wrote my first NN implementation in 2003 (used here, many years ago, as a showcase of Redis modules https://github.com/antirez/neural-redis), and never stopped since there to implement, for fun, small GPT models and things like that, using PyTorch or C.

Also my work at Redis Vector Sets, in the latest year, exposed me more to working with models (especially text embedding models of many kinds, but also other models).

So while Claude was fundamental to get the implementation fast, I had background to have idea about what was happening in the different stages. I believe it is a very interesting question to understand if this kind of work can be made with programming background and near-zero AI background. My feeling is that you ned more time, more back and forth, maybe to provide the agent with more examples, but eventually it will do something working.

dewarrn1

20 days ago

Ah, the other AI: Actual Italian. (w/apologies to Davie504)

antirez

20 days ago

lol I saw that video.

imranq

20 days ago

Just because it is in C, doesn't mean you will get C like performance. Just look at the benchmarks, it is 8x slower than just using PyTorch... while I get its cool to use LLMs to generate code at this level, getting super high performing optimized code is very much out of the domain of current frontier LLMs

jrk

20 days ago

The PyTorch version is using the GPU (with Metal Performance Shaders); this C version is currently using (in the docs I saw) a single CPU core, with AMX (via Apple Accelerate BLAS) but not yet with OpenMP for parallelism. It’s not slow because LLM code is bad, but because it’s not running on the same hardware. That said, it’s also not as fast as it is because of the LLM—all the critical code is in kernel libraries it calls (the same as for PyTorch).

antirez

20 days ago

Absolutely true, but now I'll focus on making it fast and I believe it will be possible to go much faster. I left the agent working in the night with a specification and now I'm going to see the progresses and restart the work.

nbardy

20 days ago

No it’s not. I have written cuda kernels and 8bit optimizers with this.

They’re actually very good at speed optimization and can iterate very quickly taking notes on trials and failures and benchmarks. I’ve had it write 10 different attempts in around an hour and benchmark them all then merge and beat very strong baselines in torch

antirez

19 days ago

Updates:

1. Now it is much faster, and the Python benchmarks were re-done correctly (the benchmark didn't account for model loading, and did warm-up before starting the actual inference, while the C code was tested exactly in the reverse way).

2. Now there is --mmap support to run on Linux with blas target with 16GB of RAM. Inference is viable on my old-ish Dell Latitude i5.

3. Seed now part of the PNG metadata.

4. Many other improvements, check the README.

throwaway2027

20 days ago

If I asked Claude to do the same can I also just put MIT license on it with my name? https://github.com/black-forest-labs/flux2 uses Apache License apparently. I know it doesn't matter that much and as long as it's permissive and openly available people don't care it's just pedantics but still.

antirez

20 days ago

The reference code shows how to setup the inference pipeline. It does not implement 99% of what the C code does. That is, the inference kernels, the transformer and so forth.

eikenberry

19 days ago

Assuming this was done in a US jurisdiction it doesn't matter what license you put on it as it is public domain and it needs no license. The US copyright office has ruled that anything AI generated is not covered by copyright.

jacquesm

19 days ago

Correction: it has ruled that anything AI generated is not copyrightable. That's a very important little difference and it does not mean that the production of the AI is not covered by copyright, it may well be (though proving that is going to be hard in most cases).

eikenberry

19 days ago

I'm not sure I see the difference. The rule is that anything not produced by a human is not copyrightable and is in the public domain. If something is not copyrightable and in the public domain how can it be covered by copyright?

jacquesm

19 days ago

> I'm not sure I see the difference.

The difference is massive because the source material is covered by copyright. So even if the product can't be copyrighted there is a fair chance that you'll get your ass sued by whoever is able to trace back some critical part of that product to their own work of which yours is now a derived work.

eikenberry

18 days ago

I'm talking about original, greenfield projects that was entirely written by an AI agent. There is no source material here beyond the agent and prompting. Prompting, AFAIK, hasn't been considered sufficient to make it a human produced work.

Or are you getting at the idea that the works the AI was originally trained on could still be considered an original work the generated code was derived from? Like if the generate code happens to look like someones code in github, that they could sue? I'm not 100% on sources here but I thought this was already tested in court and ruled it wasn't infringement.

netdur

20 days ago

i would love if you took the time to instruct claude to re-implement inference in c/c++, and put an mit license on it, it would be huge, but only if it actually works

badsectoracula

20 days ago

FWIW stable-diffusion.cpp[0] (which implements a lot more than just stable diffusion, despite the name) is already a MIT licensed C++ library.

[0] https://github.com/leejet/stable-diffusion.cpp/

csto12

20 days ago

As someone who doesn’t code in C and does more analytics work (SQL), is the code generated here “production grade?” One of the major criticisms I hear about llms is they tend to generate code that you wouldn’t want to maintain, is that the case here?

chrsw

20 days ago

It's not bad. Skimming the code I'd say it's not enterprise quality but it's definitely better than an amateur throwaway project.

keyle

20 days ago

Classic. non-enterprise C quality.

minimaxir

20 days ago

Those statements are mostly out of date and symptomatic of pre-agent-optimized LLMs. Opus 4.5 with clarifying rules in the CLAUDE.md does a good job at following idiomatic best practices in my experience.

That said, I'm mixed on agentic performance for data science work but it does a good job if you clearly give it the information it needs to solve the problem (e.g. for SQL, table schema and example data)

hirako2000

20 days ago

Not my experience. All frontier models I constantly test, agentic or not, produce code less maintainable than my (very good) peers and myself (on a decent day).

Plus they continue to introduce performance blunders.

Crying wolves, on day maybe there will be a wolf and I may be the last of us to check whether that's true.

scottmf

19 days ago

I independently did the same with an MLX implementation on Sunday (also with Claude Code).

I expected this C implementation to be notably faster, but my M3 Max (36GB) could barely make it past the first denoising step before OOMing (at 512x512)

Am I doing something wrong? The MLX implementation takes ~1/sec per step with the same model and dimensions: https://x.com/scottinallcaps/status/2013187218718753032

khimaros

20 days ago

https://github.com/leejet/stable-diffusion.cpp

abecedarius

20 days ago

A suggestion born of experience: besides printing the seed for an image, add it to the image file as metadata. Otherwise, if you're me, you'll lose it.

antirez

20 days ago

Good idea, noted in the TODO.

ThrowawayTestr

20 days ago

Comfyui does that automatically

peter_d_sherman

19 days ago

>"This program generates images from text prompts [...] using the [data from] FLUX.2-klein-4B [...] and is implemented entirely in C, with zero external dependencies beyond the C standard library."

You had me at image generation!

Pure C with zero external dependencies -- is just an extra added bonus...

No, actually, pure C with zero external dependencies -- is quite awesome in its own right!

Well done!

mmastrac

20 days ago

Is it just my connection or is the huggingface downloader completely broken? It was saturating my internet connection without making any progress whatsoever.

EDIT: https://github.com/bodaay/HuggingFaceModelDownloader seems to be making progress.

thasso

19 days ago

> I believe that inference systems not using the Python stack (which I do not appreciate) are a way to free open models usage and make AI more accessible.

What's wrong with the Python stack? I have never much used it or any other ML stack so I'm genuinely curious.

ChrisArchitect

20 days ago

FLUX.2 [Klein]: Towards Interactive Visual Intelligence

https://news.ycombinator.com/item?id=46653721

bakkoting

20 days ago

Neat! I wonder how slow this would be running in wasm. In my dream world it would also use WebGPU but that's a much bigger lift.

jitl

20 days ago

README says it’s optimized for Metal, if it really is using metal compute shader, apparently the programming model is fairly similar to WebGPU. You could try asking Claude to translate it :)

fulafel

20 days ago

Interesting that OpenBLAS and MPS are reportedly nearly the same speed although the README sounds like only MPS uses the GPU.

antirez

20 days ago

I think that this is because the current code does a terrible job at taking the activations in the GPU and fusing the kernels. This is the next thing to fix in this implementation indeed.

yunnpp

20 days ago

> I believe that inference systems not using the Python stack (which I do not appreciate) are a way to free open models usage and make AI more accessible.

What you're saying here is that you do not appreciate systems not using the Python stack, which I think is the opposite of what you wanted to say.

tomashubelbauer

20 days ago

I am an ESL speaker but I don't see why the sentence fragment in parentheses couldn't be parsed as relating only to "Python stack" as opposed to "systems not using the Python stack". I read it that way, but again, as an ESL speaker, I might be missing intuition or actual grammatical knowledge that would tick off a native speaker such as, presumably, yourself.

zipy124

20 days ago

It is based upon context, you are correct that it is ambiguious, as is the problem of most natural language.

-I believe that <inference systems not using the Python stack> (which I do not appreciate) are a way to free open models usage and make AI more accessible.

This reading of the text would lead one to believe they don't appreciate inferences systems not written in python. Given the inference system produced by the author is also not using the python stack (it is in C), we can assume this is not the correct reading.

-I believe that inference systems not using the <Python stack> (which I do not appreciate) are a way to free open models usage and make AI more accessible.

This reading says that the author does not like the python stack for inference, which given the author has produced this inference in C, would support the statement.

That is we have to take both readings and think which one fits the context around it, hopefully this helps :)

filipstrand

20 days ago

Really cool project! Impressive to see this being done in pure C.

I'm the maintainer of MFLUX (https://github.com/filipstrand/mflux) which does a similar thing, but at a higher level using the MLX framework optimised for Apple Silicon. I just merged Flux 2 Klein support as well and was happy to see this discussion :)

I started out doing this type of work roughly 1.5 years ago when FLUX.1 was released and have been doing it off and on ever since with newer models, trying to use more and more AI over time.

At one point, I vibe-coded a debugger to help the coding agent along. It worked OK but as models have gotten stronger, this doesn't really seem to be needed anymore. My latest version simply has a SKILL.md that outlines my overall porting strategy (https://github.com/filipstrand/mflux/blob/main/.cursor/skill...). Somewhat surprisingly, this actually works now with Cursor + Codex-5.2, with little human intervention.

> Even if the code was generated using AI, my help in steering towards the right design, implementation choices, and correctness has been vital during the development.

This definitely resonates! Curious to hear more about what worked/didn't for you. A couple of things I've found useful:

- Porting the pipeline backwards: This is the way I did it personally before using any coding models. The typical image generation flow is the following:

1.Text_encodings (+ random_noise_latent) 2.Transformer loop 3.VAE decoding

I found that by starting with the VAE first (by feeding it pre-loaded tensors from the reference extracted at specific locations) it was the quickest way to get to an actual generated image. Once the VAE is done and verified, only then proceed backwards the chain and handle the Transformer, etc. I still prefer to do it this way and I like to manually intervene between step 3,2 and 1, but maybe this won't actually be needed soon?

- Also, with the VAE, if you care about implementing the encoding functionality (e.g to be used with img2img features), the round-trip test is a very quick way to verify correctness:

image_in -> encode -> decode -> image_out : compare(image_in, image_out)

- Investing in a good foundation for weight handling, especially when doing repeat work across similar models. Earlier coding models would easily get confused about weight assignment, naming conventions etc. A lot of time could be wasted because weight assignment failed (sometimes silently) early on.

antirez

20 days ago

Thanks, super interesting remarks. Funny enough I also went the route of implementing it backward :D Initially I totally skipped the text encoding part. Used Python to generate binary blogs. Also the easy 1 step test with image output correctness that you can implement immediately, with fixed seed and max pixel difference bound to floating point approximation errors, is super useful to provide the model with an immediate correctness feedback after a change.

falloutx

20 days ago

I dont understand, so its just to generate the pic using a model. Isn't that trivial, whats the advantage of doing it in C? Is the model running in C? Readme is overly verbose and It seems like a project that just does one task and it costed the author $80.

antirez

20 days ago

Why it costed me $80? I pay the monthly subscription, and during that time I use it for multiple things. I spent one day of Claude Max on that so it costed me 2 euros and 60 cents.

falloutx

20 days ago

I may have misread that. Anyway, its 2.60 euros plus value of your time unless it was done with ralph agents.

antirez

20 days ago

It is worth it. There is value in making open frontier models simple to run / integrate.

user

20 days ago

[deleted]

gbalduzzi

20 days ago

Yes, the model runs in C, you just provide the model weights to the program.

The main advantage is that you don't need the python interpreter to run the program.

While not revolutionary, it is definitely not trivial and its main purpose is to demonstrate Claude code abilities in a low level, non trivial task.

lovasoa

20 days ago

The author of this project is also the author of redis. He knows what he is doing.

Running inference for a model, even when you have all the weights, is not trivial.

fabmilo

20 days ago

because of the principle: you only understand what you can create. You think you know something until you have to re-create it from scratch.

kurtis_reed

20 days ago

Confusing to have project name same as file name

antirez

20 days ago

This is to follow the naming of llama.cpp

reactordev

20 days ago

This is both awesome and scary. Yes, now we can embed image gen in things like game engines and photoshop or build our own apps. On the other hand, we can include image gen in anything…

nusl

20 days ago

This was possible before, though

rvz

20 days ago

Yes, it was always possible.

It's almost as if this is the first time many have seen something built in C with zero dependencies which makes this easily possible.

Since they are used to languages with package managers adding 30 package and including 50-100+ other dependencies just before the project is able to build.

snarfy

20 days ago

rip 1425

https://xkcd.com/1425/

treksis

20 days ago

how fast is this compare to python based?

antirez

20 days ago

Very slow currently, I added the benchmarks in the README. To go faster it needs to implement inference faster than the current float32-only kernels.

rcarmo

20 days ago

The Python libraries are themselves written in C/C++, so what this does performance-wise is, at best, cutting through some glue. Don't think about this as a performance-driven implementation.

throwaway314155

20 days ago

PyTorch MPS is about 10x faster per the README.md.

antirez

20 days ago

I cut the difference in speed by half by taking the activations on the GPU. Time to sleep but will continue tomorrow.

Numerlor

20 days ago

Have you tried e.g. Mojo that can vectorize/do SIMD without having to do intrinsics everywhere?

1vuio0pswjnm7

19 days ago

"I believe that inference systems not using the Python stack (which I do not appreciate) are a way to free open models usage and make AI more accessible."

holografix

20 days ago

No cuBLAS?

cboyardee

20 days ago

[dead]

DamianLewis

20 days ago

[dead]

llmidiot

20 days ago

[flagged]

antirez

20 days ago

One of the most important thing to do right now to redistribute something to the society, is to use AI to write free software: more free software than ever. If AI will be hard to access in the future, the more software it is released free, the better. If instead things go well (as I hope), there will be just a multiplication of the effect of OSS using today and tomorrow AI. In any way, writing free software using AI is a good idea, IMHO. I believe LLMs are the incarnation of software democratization, which aligns very well with why I used to write OSS. LLMs "steal" ideas, not verbatim code, you can force them to regurgitate some verbatim stuff, but most of it is ideas, and we humans also re-elaborate things we saw and we avoid (like LLMs are able to do) to emit the same stuff verbatim. But the software can't be patented for very good reasons, and LLMs capture all this value that is not subject to intellectual property, and provides it to the people that don't have the right tools and knowledge. And, it allows people that can code, to code 100x more.

hollowturtle

20 days ago

> redistribute something to the society

with a proprietary black box tool you pay a subscription for? that's nonsense

jitl

20 days ago

In Greek mythology Prometheus took fire from the gods and gave it to humans, for the low subscription fee of a liver a day.

airstrike

20 days ago

You can always run models locally? Local models will become cheaper and faster over time. Don't panic just yet

hollowturtle

20 days ago

Will this will that and never consider will not. As of now, observation made on evidence, it looks way more the latter to me.

dkdcio

20 days ago

this argument is nonsense…I write code on a macbook running macos. it’s not a subscription, but some people also pay a subscription for a proprietary IDE. so any FOSS written with proprietary paid software doesn’t count to you? only if it’s a subscription model?

hollowturtle

20 days ago

> I write code on a macbook running macos. it’s not a subscription

You already answered yourself, but let's pretend yours is a valid point: you lose access to Jetbrain IDE you can still code on another free ide/text editor and still give to society without heavily relying on ai somewhere in the cloud of the tech bros, which they don't want to give back to society, they want to be the gatekeepers of programming.

dkdcio

20 days ago

and you can switch AI providers, or use local LLMs. again, a nonsense point to raise about how FOSS is developed. coding “by hand” also doesn’t go away. if you lose your proprietary tools (laptop, OS, IDE, or coding agent) you can always work around it

williamcotton

20 days ago

> broad copyright violations

You’re probably not going to be very pleased with how this all plays out in court.

https://en.wikipedia.org/wiki/Idea%E2%80%93expression_distin...

https://en.wikipedia.org/wiki/Structure,_sequence_and_organi...

https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari...

LLMs work precisely because they converge on the abstract ideas in our languages. Anything verbatim is de minimus or more likely filtered out through the lens of the above doctrines.

simonw

20 days ago

Corporations love copyright. I don't see why "supports broad copyright violations" and "go fully corporate" fit together here.

Corey Doctorow wrote about that recently: https://pluralistic.net/2025/12/05/pop-that-bubble/#u-washin...

He's not a generative AI fan at all, but he argues that if artists think tightening up copyright law to make it harder to train models is a good idea they're going to find themselves strengthening the very corporate overlords that they've been trying to escape.

re

20 days ago

> I wanted to see if, with the assistance of modern AI, I could reproduce this work in a more concise way, from scratch, in a weekend.

I don't think it counts as recreating a project "from scratch" if the model that you're using was trained against it. Claude Opus 4.5 is aware of the stable-diffusion.cpp project and can answer some questions about it and its code-base (with mixed accuracy) with web search turned off.

antirez

20 days ago

The two projects have literally nothing in common. Not a line of code, not the approach, nor the design. Nothing. LLMs are not memorization machines that recall every project in the cut & paste terms you could think of.

user

20 days ago

[deleted]