hackernews client

Generative AI Image Editing Showdown

342 pointsposted 3 months ago

(genai-showdown.specr.net)

82 Comments

minimaxir

3 months ago

Everyone is sleeping on Gemini 2.5 Flash Image / Nano Banana. As shown in the OP, it's substantially more powerful than most other models while at the same price-per-image, and due to its text encoder it can handle significantly larger and more nuanced prompts to get exactly what you want. I open-sourced a Python package for generating from it with examples (https://github.com/minimaxir/gemimg) and am currently working on a blog post with even more representative examples. Google also allows generations for free with aspect ratio control in AI Studio: https://aistudio.google.com/prompts/new_chat

That said, I am surprised Seedream 4.0 beat it in these tests.

daemonologist

3 months ago

I don't think people are really sleeping on it - nano-banana more or less went viral when it first came out. I'd argue that aside from the capabilities built into ChatGPT (with the Ghibli craze and whatnot) craze it's the best known image editing model.

minimaxir

3 months ago

It's a weird situation where the Gemini mobile app hit #2 on the App Stores because of free Nano Banana, but no one ever talks about it and most disclosed image generations I've seen are still ChatGPT.

ec109685

3 months ago

Google photos should just include the feature. It’s kinda buried in Gemini.

Google is so weirdly non-integrated.

piquadrat

3 months ago

They announced that Nano Banana will be integrated in Google Photos a couple weeks ago.

https://blog.google/technology/ai/nano-banana-google-product...

troupo

3 months ago

> It’s kinda buried in Gemini.

> Google is so weirdly non-integrated.

Where by try gemini non- integrated have you tried gemini you mean gemini is here they shove use gemini gemini into every single product they have?

ec109685

3 months ago

It is terrible in all those services.

vunderba

3 months ago

> That said, I am surprised Seedream 4.0 beat it in these tests.

OP here. While Seedream did have the edge in adherence it also tends to introduce slight (but noticeable) color gradation changes. It's not a huge deal for me, but it might be for other people depending on their goals in which case NanoBanana would be the better choice.

cosama

3 months ago

I was trying to use gemini 2.5 flash image / nano banana to tidy up a picture of my messy kitchen. It failed horribly on my first attempt. I was quite surprised how much trouble it had with this simple task (similar to cleaning up the street in the post). On my second attempt I had it first analyze the image to point out all the items that clutter the space, and then on a second prompt had it remove all those items. That worked much better, showing how important prompt engineering is.

veunes

3 months ago

That actually proves how important the “number of attempts” metric is. It’s not just a “make everything pretty” button - it’s more like a powerful but slightly dumb intern who needs clear, step-by-step instructions. Your two-step approach really captures the essence of prompt engineering

vunderba

3 months ago

Yeah, that's part of the reason I list the number of attempts as part of the stats for each model + respective prompt. It's a loose metric of how "steerable" a given model is, or put another way, how much I had to fight with it before we were able to get it to follow the prompt directives.

herval

3 months ago

Gemini is great when it gets it right, but in my experience, it sometimes gives you completely unexpected results and won't get it right no matter what. You can see that in some of the examples (eg the Girl with the pearl earring one). I'm constantly surprised by how good Flux is, but the tragedy is most people (me included) will just default to whatever they normally use (chatgpt and gemini, in my case), so it doesn't really matter that it's better

tigershark

3 months ago

Flux kontext quality is noticeably worse that nano banana, Qwen image 2509 and Seedream 4 most of the times. For pure image generation instead Hunyuan image is scarily good.

dimitri-vs

3 months ago

Agreed, to the point where I built my own UI where I can simultaneously generate three images and see a before/after. Most often only one of three is what I actually wanted.

epiccoleman

3 months ago

half the time when i try to use nano banana, AI Studio fails, telling me it can't generate for some unspecified reason.

these aren't cases where I'm trying to do something that skirts the edge of copyright, either (like "Ghiblifying" images, for example).

that said, when it does work, it is super impressive.

minimaxir

3 months ago

Let's just say I've tested around this.

Copyright: Zero guardrails on anything related to third-party IP, which lets you do some funny things. (I'm including a picture/prompt of Super Mario, Mickey Mouse, and Bugs Bunny partying at a nightclub in the blog post)

Moderation: It has far fewer guardrails and any other Google AI product I've tried, and it is possible to prompt engineer some images that would definitely be considered NSFW by most people — more NSFW than actual NSFW image generators (a post-generation filter will catch most nudity, however). I have not had any rejections for more innocous queries that could be misinterpreted as being NSFW.

vunderba

3 months ago

It might be the safety moderation system. It's rather aggressive and when it does kick in (at least in the API), it often returns an empty response giving basically zero indication as to the root cause.

minimaxir

3 months ago

The empty response issue is annoying since there is already a PROHIBITED_CONTENT flag, but it is not used in this case.

BoorishBears

3 months ago

No one is sleeping on nano-banana/Gemini Flash, it's highly over-tuned for editing vs novel generation and maxes out at a pretty low resolution.

Seedream 4.0 is somewhat slept on for being 4k at the same cost as nano-banana. It's not as great at perfect 1:1 edits, but it's aesthetics are much better and it's significantly more reliable in production for me.

Models with LLM backbones/omni-modal models are not rare anymore, even Qwen Image Edit is out there for open-weights.

veunes

3 months ago

Gemini likely has a more powerful text encoder, which is why it's better at parsing complex, nuanced prompts. Seedream, on the other hand, might have a more advanced diffusion U-Net architecture that's better at preserving textures and handling local edits. One model understands better, the other draws better

tigershark

3 months ago

Seedream 4 is better than nano banana on average, so that test result seems accurate to me

user

3 months ago

[deleted]

franze

3 months ago

honest question: where is / how to do aspect ratio control for nano banana in aistudio?

minimaxir

3 months ago

It's on the right sidebar if Nano Banana is selected.

cpursley

3 months ago

Meh, most Google AI products look great on paper but fail in actual real scenarios. And that ranges from their Claude Code clone to their buggy storybook thing which I really wanted to like.

lxe

3 months ago

This is vastly more useful than benchmark charts.

I've been using Nano Banana quite a lot, and I know that it absolutely struggles at exterior architecture and landscaping. Getting it to add or remove things like curbs, walkways, gutters, etc, or to ask to match colors is almost futile.

estetlinus

3 months ago

I am trying Qwen Image Edit for turning day photos into night, mostly architecture etc. Most models are struggling, and Nano Banana misses edges and stuff, making the pictures align poorly.

roenxi

3 months ago

It is fun being one of the elderly who set their standards back in distant 2022. All these demos look incredible compared to SD1, 2 & 3. We've entered a very different era where the models seem to actually understand both the prompt and the image instead of throwing paint at the wall in a statistically interesting manner.

I think this was fairly predictable, but as engineering improvements keep happening and the prompt adherence rate tightens up we're enjoying a wild era of unleashed creativity.

zamadatix

3 months ago

I still feel varying the prompt text, number of tries, and varying strictness combined with only showing the result most liked dilute most of the value in these test. It would be better if there was one prompt 8/10 human editors understood and implemented correctly and then every model got 5 generation attempts with that exact prompt on different seeds or something. If it were about "who can create the best image with a given model" then I'd see it more, but most of it seems aimed at preventing that sort of thing and it ends up in an awkward middle zone.

E.g. Gemini 2.5 Flash is given extreme leeway with how much it edits the image and changes the style in "Girl with Pearl Earring" only to have OpenAI gpt-image-1 do a (comparatively) much better job yet still be declared failed after 8 attempts, while having been given fewer attempts than Seedream 4 (passed) and less than half the attempts of OmniGen2 (which still looks way farther off in comparison).

cttet

3 months ago

A "worst image" instead of best image competition may be easy to implement and quite indicative of which one has less frustration experience.

vunderba

3 months ago

OP here. That's kind of the idea of listing the number of attempts alongside failure/successes. It's a loose metric for how "compliant" a model is - e.g. how much work you have to put it in order to get a nominally successful result.

zamadatix

3 months ago

The OpenAI gpt-image-1 example was supposed to be noted as for the "You Only Move Twice" test.

hackthemack

3 months ago

I do not use ai image generating much lately. It seemed like there was a burst of activity a year and half ago with self hosted models and using some localhost web guis. But now it seems like it is moving more and more to online hosted models.

Still, to my eye, ai generated images still feel a bit off when doing with real world photographs.

George's hair, for example, looks over the top, or brushed on.

The tree added to the sleeping person on the ground photo... the tree looks plastic or too homogenized.

minimaxir

3 months ago

> But now it seems like it is moving more and more to online hosted models.

It's mostly because image model size and required compute for both training and inference have grown faster than self-hosted compute capability for hobbyists. Sure, you can run Flux Kontext locally, but if you have to use a heavily quantized model and wait forever for the generation to actually run, the economics are harder to justify. That's not counting the "you can generate images from ChatGPT for free" factor.

> George's hair, for example, looks over the top, or brushed on.

IMO, the judge was being too generous with the passes for that test. The only one that really passes is Gemini 2.5 Flash Image:

Flux Kontext: In addition to the hair looking too slick, it does not match the VHS-esque color grading of the image.

Qwen-Image-Edit: The hair is too slick and the sharpness/saturation of the face unnecessarily increases.

Seedream 4: Color grading of the entire image changes, which is the case with most of the Seedream 4 edits shown in this post, and why I don't like it.

janalsncm

3 months ago

For 99% of my use cases I’ll just use ChatGPT or Gemini due to convenience. But if you want something with a specific style, Flux LoRAs are much better, in which case I’ll boot up the old 4090.

The economics 1000% do not justify me owning a GPU to do this. I just happen to own one.

veunes

3 months ago

I think fine-tuning could fix that problem

If you take a base model and train it on a hundred Seinfeld frames, it would pick up the specific style - the color grading, grain, lighting - and it would add the hair way more naturally

jimmyl02

3 months ago

I think reve (https://reve.com) should be in the running and would be very curious to see the results!

achow

3 months ago

Thank you for the pointer. I was struggling with Nanobanana for editing an image which it had created earlier, but Reve gave me the edit result exactly the way I wanted in the first pass.

My usecase: An image of a cartoon character, holding an object and looking at it. Wanted to edit so that the character no longer has the object in her hand and now looking towards the camera.

Result Nanobanana: At first pass it only removed the object that the character was holding, however there was no change in her eyeline, she was still looking down at her now empty hand. Second prompt explicitly asked to change the eyeline to look at camera. Unsuccessful. Third attempt asked the character to look towards ceiling. Success but unusable edit as I wanted the character to look at the camera.

Result Reve: At first attempt it gave me 4 options and all 4 are usable. It not only removed the object and changed the eyeline of the character to look at the camera, but it also made posture changes so that the empty hands were appropriately positioned, and now since the character is in a different situation (sans the object that was holding her attention) Reve posed the character in different ways which were very appropriate - which I didn't think of prompting for earlier (maybe because my focus was on immediate need - object removal and change in eyeline).

On a little more digging found this writeup which will make me to signup for their product.

https://blog.reve.com/posts/reve-editing-model/

vunderba

3 months ago

OP here. Thanks for the recommendation. I'll check it out and try to get them added!

ImHereToVote

3 months ago

Thanks for the tip.

shridharathi

3 months ago

Here's a post I wrote on the Replicate blog putting these image editing models head-to-head. Generally, I found Qwen Image Edit to be the cheapest and fastest model that was also quite capable of most image editing tasks.

If I were to make an image editing app, this would be the model I'd choose.

https://replicate.com/blog/compare-image-editing-models

silisili

3 months ago

Neat comparison. The only qualm I have is giving a pass on that last giraffe... it's not visibly any shorter, just bent awkwardly.

Even so, Gemini would lose by 1, but I found that I would often choose it as the winner(especially say, The Wave surfer). Would love to see a x/10 instead of pass/fail.

vunderba

3 months ago

Yeah that's a fair critique. Your description made me laugh. Can't wait to go to a zoo exhibit featuring "AWKWARDLY BENT GIRAFFE".

joomla199

3 months ago

Good effort, somewhat marred by poor prompting. Passing in “the tower in the image is leaning to the right,” for example, is a big mistake. That context is already in the image, and passing that as a prompt will only make the model apt to lean the tower in the result.

vunderba

3 months ago

I should have been more clear. Those are NOT the direct prompts. They are the starter prompts. In fact that's why the attempt numbers change, we adapt the exact prompts depending on the model.

joomla199

3 months ago

I understood that much, at least from the description you added on the Kontext result. I agree that you should provide more information here, though, especially around "we adapt the exact prompts depending on the model", since your strategy here could also reflect model strengths and weaknesses.

vunderba

3 months ago

Good point! Perhaps I should add in the "final model-specific prompt", or place them in an errata section.

joomla199

3 months ago

By the way, this is what I got from Kontext after just a couple of tries: https://i.imgur.com/J4LwkVI.png

Prompt: "Keeping the glass and the hand behind the glass the same, please change only the three brown candies in the glass into green, yellow, red, and orange candies. Make no other changes. Change the reflection to remove the brown candy too." Seed was 1070229954903864, but your setup is probably too different for that to help.

It seems like Gemini 2.5 Flash was the only model that successfully removed the reflections...it should get some points for that!

user

3 months ago

[deleted]

user

3 months ago

[deleted]

keyle

3 months ago

This was fun.

Some might critique the prompts and say this or that would have done better, but they were the kind of prompt your dad would type in not knowing how to push the right buttons.

vunderba

3 months ago

OP here. You're the second person to say this. I cut my teeth on SD 1.5 - so I'm rather intimately familiar (for better or worse) with the level of prompt craft necessary depending on the model.

I feel like the FAQ section isn't displayed prominently enough:

How are the prompts written?

  In addition to giving models several attempts to generate an image, we also write several variations of the prompt to ensure that models don't get stuck on certain keywords or phrases depending on their training data. For example, while hippity hop is a relatively common name for the ball riding toy, it is also known as a space hopper. We try to use both terms in the prompts to ensure that models are not biased towards one or the other.

  Prompts for Hunyuan were attempted in both Chinese and English with and without Image Optimization.

Additionally when you see a prompt like "Turn on the lights" - the idea is to actually go beyond direct prompting commands - we're actually probing the capabilities of a truly multimodal LLM. It's a prompt that would spectacularly fail in more traditional models (such as SDXL).

seany

3 months ago

Is there anything like this comparison for nsfw images? I'm married to a boudoir photographer who sometimes wants to use ai tools for things, and they are all _awfull_ if there is nudity on photos. It's like some sort of neo puritanism has taken over.

tpierce89

3 months ago

I also do similar work and have run tests on many models. I have listed a few here with sample images using one prompt with a single run. I know it isn't a comprehensive review like OP, but it's something. My personal preference through experience is epicRealismXL.

https://imgchest.com/p/xny8e23jpyb

seany

3 months ago

Thanks for the tip. Need to see how well these work for inpainting

veunes

3 months ago

This is so much more useful than synthetic benchmarks. The most important column here isn't pass/fail, it's attempts. In production a model that gets it right in 2 attempts is 10x more valuable than one that needs 20 iterations of prompt engineering. It's a direct measure of cost and predictability.

Seedream 4 won on points, but Gemini seems more steerable and required less fighting on many of the tasks

kgwgk

3 months ago

Recent discussion: https://news.ycombinator.com/item?id=45708795

jumploops

3 months ago

Nit: the link there was `Text-to-Image` while this is `Image Editing`

Still useful comments, as the models mostly overlap

ineedasername

3 months ago

Kontext is very good. Get yourself a 5060 ti 16GB and never have to pay for API calls again for this purpose, at least not when you have the time spare. If you need this sort of editing at the speed of gui-clicking + 10s, then you'll need to pay API tolls, or capex for > 5070/80.

zamadatix

3 months ago

You have to REALLY be into AI to do this for generation/API cost reasons (or willing to have this as a hacking project of the month expense). Even ignoring electricity, a 16 GB 5060 Ti is more expensive than 16,000 image generations. Assuming you do one every 15 seconds, that's 240,000 seconds -> more than 2 months of usage at an hour a day of generations.

If you've already got a decent GPU (or were going to get one anyways) then cost isn't really a consideration, it's just that you can already do it. For everyone else, you can probably get by just using things like Google's AI Studio for free.

weberer

3 months ago

>a 16 GB 5060 Ti is more expensive than 16,000 image generations

Sure, but now you get a good gaming GPU that you can write off as a business expense.

ineedasername

3 months ago

16,000? Where are buying your GPU, or API calls? If you don’t want to wait for a bargain then $450 will get you the GPU, and even at that price you’d only be able to buy about 10,000 standard-resolution image gen api calls. Do you do design? Editing? Touch up? You can easily blow through a few hundred api calls an hour: “Turn the stitching green… slightly less saturated… now make the stitches more ragged… a little more… now just slightly less”.

Clearly you’re looking at the task through the eyes of a hobbyist or “of the month” project so the workflow and pace may not be obvious but API budgets spend fast. Just look at the benchmarks in this article to see how many tried some of these changes took- 47, there goes $3 in 3 minutes, or half that time if your quick on the keyboard.

And even then! Well, you’re limited aren’t you? Limited to the Gemini model, or OpenAI, or whoever, and you see the limits of any one model in the article as well. Or you plonk down for a mediocre GPU with some slight VRAM headroom and choose from dozens of models, countless Lora, control nets, and other options, infinitely flexible in painting and outpainting. Ahead of that you’ll need to budget at least a dozen hours to learn local genai tools, comfyui or others. Then, for under a $1 dollar in electricity, you can can queue up a dozen ideas overnight and get 1,000 variations on each of them handed to you in the morning to quickly triage over coffee and email catchup.

It’s not a one size fits all market though, and most professionals are likely finding they want both: A low-cost, high-control, high precision sandbox that isn’t as fast or scalable as the api, and the api for when fast and scalable is what you need.

spookie

3 months ago

GPUs are needed for plenty of reasons. I assume plenty have a decent dGPU, even on laptops.

joomla199

3 months ago

I have a 4080 RTX and Kontext runs great at fp8. I run several other models besides. If you want to get at all good at this, you need tons of throwaway generations and fast iteration and an API quickly becomes pricier than a GPU.

ineedasername

3 months ago

Precisely. Even inflated if the inflated 16,000 api calls was accurate for how much the cost of mediocre GPU would get you, that’s not an endless store of api calls. I’m also on a 4080 for lighter loads, and even just writing benchmarks, exploring attention mechanisms, token salience, etc, without image gen being my specific purpose I may trash half a thousand generations from output every few days. More if I count the stuff that never made it that far too.

zamadatix

3 months ago

The point is just having a "decent" dGPU isn't enough. Even at 16 GB you're already quantizing Flux pretty heavily, someone with a 4080 gaming laptop is going to be disappointed trying to work with 12 GB.

lschueller

3 months ago

I wonder how much longer those annoying stock photo database will continue. They are great for press photography and such. But stock pics of people in offices for a website are nothing, I would buy a min 3 month subscription for anymore

delichon

3 months ago

As generative AI eats away at the high royalty, restrictive license, consent evading, stereotype reinforcing business model of stock photo companies, it will be a challenge to resist the schadenfreude.

CobrastanJorji

3 months ago

I'm pretty sure that "replace the homeless man with a park bench" image was a reference to some TV show making a gentrification joke, but I can't put my finger on it. Anyone recall?

vunderba

3 months ago

Yeah, I couldn't help myself on that one! It's a reference to the Cypress Creek promotional video from the Simpsons.

https://www.youtube.com/watch?v=foU9W7AkKSY

pram

3 months ago

Simpsons, the Frank Scorpio episode. The advertisement for the company town shows a beggar slowly fading out and being replaced by a mailbox.

user

3 months ago

[deleted]

dev2roofer

3 months ago

Yeah, it’s kinda crazy how fast this stuff leveled up. A year ago we were happy if hands looked normal — now we’re nitpicking shadows and curb textures. Wild times.

user

3 months ago

[deleted]