vunderba
2 days ago
Updating the GenAI comparison website is starting to feel a bit Sisyphean with all the new models coming out lately, but the results are in for the Flux 2 Pro Editing model!
https://genai-showdown.specr.net/image-editing
It scored slightly higher than BFL's Kontext model, coming in around the middle of the pack at 6 / 12 points.
I’ll also be introducing an additional numerical metric soon, so we can add more nuance to how we evaluate model quality as they continue to improve.
If you're solely interested in seeing how Flux 2 Pro stacks up against the Nano Banana Pro, and another Black Forest model (Kontext), see here:
https://genai-showdown.specr.net/image-editing?models=km,nbp...
Note: It should be called out that BFL seems to support a more formalized JSON structure for more granular edits so I'm wondering if accuracy would improve using it.
woolion
a day ago
The comparison are very useful but also quite limited in terms of styles. Models tend to have extremely diverse abilities in following a given style against steering to its own.
It's pretty obvious that OpenAI is terrible at it -- it is known for its unmissable touch. However, for Flux it really depends on the style. They already posted at some point that they changed their training to avoid averaging different styles together, which is the ultimate AI look. But this is at odds with the goal to directly generate images that are visually appealing, so the style matching is going to be a problem for a while, at least.
vunderba
18 hours ago
The site is broken up into "Editing Comparison" and a "Generative Comparison" sections.
Generative: https://genai-showdown.specr.net
Editing: https://genai-showdown.specr.net/image-editing
Style is mostly irrelevant for editing, since the goal is to integrate seamlessly with the existing image. The focus is on performing relatively surgical edits or modifications to existing imagery while minimizing changes to the rest of the image. It is also primarily concerned with realism, though there are some illustrative examples (the JAWS poster, Great Wave off Kanagawa).
This contrasts with the generative section though even then the emphasis is on prompt adherence, and style/fidelity take a backseat (which is honestly what 99% of existing generative benchmarks already focus on).
woolion
15 hours ago
Oh, thank you for your reply. We may have different definitions of style and what editing would mean.
If you look for example at "Mermaid Disciplinary Committee", every single image is in a very different style, each that you can consider a default of what the model assume would be for the specific prompt. It's quite obvious that these styles were 'baked in' the models, and it's not clear how much you can steer in a specific style. If you look at "The Yarrctic Circle", a lot more models default to a kind of "generic concept art" style (the "by greg rutkowski" meme) but even then I would classify the results as at least 5 distinct styles. So for me this benchmark is not checking style at all, unless you consider style to be just around 4 categories (cartoon, anime, realistic, painterly).
So regarding image editing, I did my own tests at the first release of Flux tools, and found that it was almost impossible to get any decent results on some specific styles, specifically cartoon and concept art styles. I think the tools focus on what imaginary marketing people would want (like "put this can of sugary beverage into an idyllic scene") rather than such use cases. So editing like "color this" or other changes would just be terrible, and certainly unusable.
woolion
15 hours ago
I didn't go very far with my own benchmarks because my results were just so bad. But for example, here's a line art with the instruction to color it (I can't remember the prompt, I didn't take notes).
https://woolion.art/assets/img/ai/ai_editing.webp
It's original, ChatGPT, Flux.
Still, you can see that ChatGPT just throw everything out and does not do a minimal attempt at respecting style. Flux is quite bad, but it follows the design much more (although it gets completely confused by it) that it seems that with a whole lot of work you could get something out of it.
vunderba
15 hours ago
Yeah so NOVEL style transfer without the use of a trained LoRA is, to my knowledge, still a relatively unsolved problem. Even in SOTA models like Nano Banana Pro, if you attach several images with a distinct artistic style that is outside of its training data and use a prompt such as:
"Using the attached images as stylistic references, create an image of X"
It's fall down pretty hard.
woolion
13 hours ago
I'm pretty sure that some model at least advertised that it would work. I also think your example was in the training data at some point least, but I suspect these styles are kind of pruned when the models are steered towards "aesthetically pleasing" outputs which are often used as benchmarks. Thanks for the replies, it's quite informative.
vunderba
10 hours ago
Sure! So that image was pretty zoomed out, I've gone ahead and attached some of the reference images in greater detail:
https://imgur.com/a/failed-style-transfer-nb-pro-o3htsKn
Now you should be able to see that the generated image is stylistically not even close to the references (which are early works by Yoichi Kotabe). Pay careful attention to the characters.
With locally hostable models, you can try things like Reference/Shuffle ControlNets but that's not always successful either.
spaceman_2020
a day ago
Clearly Google is winning this by some margin
Seedream is also very good and makes me think the next version will challenge Google for SOTA image gen
Increasingly feels like image gen is a solved problem
raxxorraxor
18 hours ago
I think the margin isn't that large to be honest. If we compare available resources and data it is quite tiny and perhaps should be larger.
Also it doesn't feel solved to me at all. There is no general model, perhaps it cannot reasonably exist. I think these tests are benchmarks are smart, but they don't show the whole picture.
Domain specific image generation tasks still require a domain specific models. For art purposes SD1.5 with specialized and finely tuned checkpoints will still provide the best results by far. It is also limited, but I think it dampened the hype for new image generators significantly.
spunker540
16 hours ago
Does SD1.5 suffer from resolution / coherence / complexity issues?
I understand most outputs could be fine tuned for most domains, but still felt sd1.5 had a resolution ceiling, and a complexity ceiling no matter how good the fine tuning
raxxorraxor
2 hours ago
Yes, the toolchains around it can alleviate it, but only to a degree. You more or less dependent on a fine tune specifically trained for the things you want. But if you have that, the image quality is usually far better than from any generic model in my opinion, aside from resolution.
Merging any or all concepts is mostly beyond it, but I haven't seen any model being good at it yet. There are some that are significantly better, but often come with other disadvantages.
Overall what these models can do is quite impressive. But if you want a really high quality image, finding the fitting model is as difficult as finding the right prompt. And the general models tend to always fall back to some mean AI standard image.
vunderba
14 hours ago
Yeah SD 1.5 is mostly trained on datasets of resolution of 512x512. That's why you'd get crazy multi-limb goro abominations if you pushed checkpoints too much higher than 768x768 without either using a Hires Fix or Img2Img.
There's not much of a reason to use SD 1.5 over SDXL if image quality is paramount.
A lot of people (myself included) use a pipeline that involves using Flux to get the basic action / image correct, then SDXL as a refiner and finally a decent NMKD-based upscaler.
ttul
20 hours ago
Prompt understanding will only ever be as good as the language embeddings that are fed into the model’s input. Google’s hardware can host massive models that will never be run on your desktop GPU. By contrast, Flux and its kin have to make do with relatively tiny LLMs (Qwen Image uses a 7B-param LLM).
bn-l
16 hours ago
Hey I hope you see this. The scoring needs to be a 0-10 or something with a range rather than pass or fail. Flux one getting the same score for the surfer as Gemini pro 3 reduces the quality of the benchmark.
vunderba
10 hours ago
Hi bn-l, yeah as mentioned above and in the Release Notes - we'll be adding a more nuanced numerical score in the next week.
I don't know if I'm going to get as granular as 1-10 only because the finer the scoring - the more potential for subjectivity. That's why it was initially set up as a "Minimum Passing Criteria Rule Set" along with a Pass/Fail grade.
A suggestion from a previous HN post was something along the lines of (0 Fail, 0.5 Technical Pass, 1.0 Proficient Pass).
sroussey
18 hours ago
On the site: s/sttae/state/g
echelon
a day ago
How much energy does BFL have to keep playing this game against Google and ByteDance (SeeDream)?
If their new fancy model is only middle of the pack, and they're not as open source as the Chinese Qwen image models (or ByteDance / Alibaba / Lightricks video models), what's the point?
It's not just prompt adherence, the image quality of Flux models has been pretty bad. Plastic skin, inhumanely chiseled chins, that general faux "AI" aura.
Indeed, the Flux samples in your test suite that "pass" look God-awful. It might "pass" from a technical standpoint, but there's no way I'd choose Flux to solve my workflows. It looks bad.
(I wonder if they lack people on their data team with good aesthetic taste. It may be as simple as that.)
I think this company is struggling. They're pinned between Google and the Chinese. It's a tough, unenviable spot to be in.
I think a lot of the foundation model companies in media are having a really hard time: RunwayML, PikaLabs, LumaLabs. Some of them have pivoted hard away from solving media for everyone. I don't think they can beat the deep-pocketed hyperscalers or the Chinese ecosystem.
BFL just raised a massive round, so what do I know? I just can't help but feel that even though Runway raised similar money, they're struggling really hard now. And I would really not want to be fighting against Google who is already ahead in the game.
latentspacer
a day ago
i may be wrong, but it doesn't seem like BFL is struggling to me. they were apparently founded in august 2024, and have already signed $100M+ revenue deals with customers like meta (https://www.bloomberg.com/news/articles/2025-09-09/meta-to-p...)
in fact, it seems like BFL has benefited a lot by becoming the go-to alternative for big enterprise customers who don't want to be dependent on google
Bombthecat
17 hours ago
The contract is still going / will be going on in 2026?
echelon
a day ago
Wow, I didn't hear about this. That's impressive, and kudos to the team.
That's why they raised the massive round, then.
But this just leads to more questions - I have to wonder if and for how long this is just going to be to plug in a gap for Meta's own AI product offering. At some point they'll want to build their own in-house models or perhaps just acquire BFL. Zuckerberg would not be printing AI data centers if that wasn't the case.
From a PLG standpoint, Flux isn't really what graphics designers are choosing for their work. The generations look worse than OpenAI's "piss filter". But aesthetics might not be the play the team is going after.
Hopefully they don't just raise all of this dry powder energy and burn it trying to race Google. They should start listening to designers and get in their good graces if their intent is to build tools for art and graphics design work.
A good press release would consist of lots of good looking images and a video of workflows that save artists time. This press release doesn't connect with graphics designers at all and it reads as if they aren't even the audience.
If it's something else, more "enterprise", that BFL is after, then maybe I don't know the strategy or game plan.
latentspacer
a day ago
idk it seems pretty clear BFL’s target market is developers not graphic designers. and for developers at scale like Meta and Adobe, it’s pretty incredible a tiny startup like BFL has become the primary alternative to Google with 1/100th of the resources within 12 months of their founding, doing hundreds of millions of revenue
the Chinese models are great, but no serious enterprise developer is going to bet their image workloads at scale in production on Chinese models if the market evolves anything like past developer infrastructure
throwaway314155
17 hours ago
How is an image generation model serving the market of...developers? I mean I know we all focus on these models and get excited about what they can do. But why would we pay for them for more than a few tests?
rhdunn
a day ago
Reading the post the architectural change is combining a vision model (Mistral 3 in the flux.2 case) with a rectified flow transformer.
I wonder if this architectural change makes it easier to use other vision models such as the ones in Llama 3 and 4, or possibly a future Llama 5.
vunderba
a day ago
Sadly, I tend to agree. I'm rooting for BFL, but the results from this latest model (the Pro version, of all things) have just been a bit disappointing. Google’s release of NB Pro last week certainly didn’t help either, since it set the bar so incredibly high.
Flux 2 Pro only scored a single point higher than the Kontext models they released over half a year ago.
The text-to-image side was even more frustrating. It often felt like it was actively fighting me, as evidenced by the high number of re-rolls required before it passed some of the tests (Cubed⁵, for example).