New AI diffusion model approach solves the aspect ratio problem

51 pointsposted a year ago
by gmays

21 Comments

GaggiX

a year ago

The aspect ratio problem was solved by NovelAI when they trained SD v1.4 on images with different aspect ratios using a technique they call "Aspect Ratio Bucketing", and after that it became commonly used in the final stage of training.

https://blog.novelai.net/novelai-improvements-on-stable-diff...

gwern

a year ago

It was also solved by the even easier approach of aspect ratio conditioning, where you just pass in the dimensions of the crop to the NN like SDXL: https://arxiv.org/pdf/2307.01952#page=3&org=stability

GaggiX

a year ago

How does this replace "Aspect Ratio Bucketing"? Are they padding the smaller images and masking the attention?

whywhywhywhy

a year ago

Nothing here seems that impressive and none of the ratios shown are deviating that much from what anything post SDXL can just do anyway.

Might have been impressed if some extreme letterbox or vertical banner style extreme portrait was shown but everything shown here works fine in SDXL and especially Flux and the cat image doesn't even feature a press conference or journalists

bee_rider

a year ago

The two images shown in the article using the new method are sort of… stylized or slightly cartoonish in a way that the images they generated without using their method are not. Their images also have a “perfectly framed, looking straight at the camera,” which looks a little artificial. The images not using their method have a more natural look (although, obviously, they have the issue with the duplicated subject).

I wonder if it is an unavoidable result of their method, or if it is just a little issue (of course it is hard to get infinite compute as an academic, maybe they just need to train more. Is that a thing? I don’t AI).

BugsJustFindMe

a year ago

Cartoonish output is a problem across the board. If you explicitly ask Dall-E for a "photograph" of something, you will very often get a result that looks like a cartoonified illustration. Prompt writers resort to specifying exact camera models and lenses to try to constrain the process.

adamanonymous

a year ago

There are fine tuned models out there that can generate near photo-realistic results. The base SD models and those offered by the major AI service sites have a more stylized look to them. Probably partially to work on a wider array of prompts that may include non photorealistic subjects, and partially for safety.

user

a year ago

[deleted]

refulgentis

a year ago

The problem as described was solved eons ago. I'm honestly struggling to remember when this was an issue. Certainly pre SD 1.5, maybe 2021?

I assume something got lost in translation to PR.

NBJack

a year ago

1.5 still has this issue, particularly with specific subjects (i.e. the owl) any time you step significantly beyond the stock resolution (i.e. 1024x512). SDXL, while more stable, can also suffer from this.

The trouble is really the "window" by which the model operates in.

refulgentis

a year ago

I use 1.5 hundreds of times a day outside this resolution, it must be the subjects I'm using. And that you mention it, SD XL was awful at it.

isoprophlex

a year ago

That's academia for you...

refulgentis

a year ago

I was being polite and shading towards this common interpretation HN has of academic PR, the article contains a quite lengthy technical description.

notum

a year ago

Just using "cropped" as a negative prompt eliminates this issue entirely on my end and produces same results as per their owl example in SDXL.

mattstir

a year ago

The original paper [0] this article is based on raises a few questions for me. It compares the authors' new technique against StableDiffusion but fails to specify which version of SD they're using for that comparison. It doesn't mention how example outputs were chosen (were they cherry-picked?). For non-square images, they seem to have specifically chosen resolutions that the other models weren't trained to output (e.g., 384 x 512) without also including ones that they were trained on (e.g., 896 x 1152). I wonder how this new technique would compare with all of that accounted for.

[0] https://openaccess.thecvf.com/content/CVPR2024/papers/Haji-A...

user

a year ago

[deleted]

bongodongobob

a year ago

What aspect ratio problem? I played around with my midjourney account just now and it flawlessly works with extreme aspect ratios.

mhog_hn

a year ago

Any diffusion models out there that work well for generating stylized graphics? Think the stuff on your typical SaaS website

user

a year ago

[deleted]