hackernews client

fxtentacle

a year ago

The title is not wrong, but it also doesn't feel correct either. What they do here is they use a pre-trained model to guide the training of a 2nd model. Of course, that massively speeds up training of the 2nd model. But it's not like you can now train a diffusion model from scratch 20x faster. Instead, this is a technique for transplanting an existing model onto a different architecture so that you don't have to start training from 0.

pedrovhb

a year ago

It does feel right to me, because it's not distilling the second model, and in fact the second model is not an image generation model at all, but a visual encoder. That is, it's a more "general purpose" model which specializes in extracting semantic information from images.

In hindsight it makes total sense - generative image models don't automatically start out with an idea of semantic meaning or the world, and so they have to implicitly learn one during training. That's a hard task by itself, and it's not specifically trained for this task, but rather learns it on the go at the same time as the network learns to create images. The idea of the paper then is to provide the diffusion model with a preexisting concept of the world by nudging its internal representations to be similar to the visual encoders'. As I understand DINO isn't even used during inference after the model is ready, it's just about representations.

I wouldn't at all describe it as "a technique for transplanting an existing model onto a different architecture". It's different from distillation because again, DINO isn't an image generation model at all. It's more like (very roughly simplifying for the sake of analogy) instead of teaching someone to cook from scratch, we're starting with a chef who already knows all about ingredients, flavors, and cooking techniques, but hasn't yet learned to create dishes. This chef would likely learn to create new recipes much faster and more effectively than someone starting from zero knowledge about food. It's different from telling them to just copy another chef's recipes.

psb217

a year ago

The technique in this paper would still be rightly described as distillation. In this case it's distillation of "internal" representations rather than the final prediction. This a reasonably common form of distillation. The interesting observation in this paper is that including an auxiliary distillation loss based on features from a non-generative model can be beneficial when training a generative model. This observation leads to interesting questions like, eg, which parts of the overall task of generating images (diffusionly) are being learned faster/better due to this auxiliary distillation loss.

byyoung3

a year ago

Yes, now it seems obvious, but before this it wasn't clear that that would be something that could speed things up, due to the fact that the pretrained model was trained on a separate objective. It's a brilliant idea that works amazingly.

psb217

a year ago

It's a classic "Will it work? IDK, maybe. Let's try it and find out..." paper.

byyoung3

a year ago

haha yeah I mean I think they are all like that to a certain extent

fxtentacle

a year ago

To me, it seemed that the technique presented here was just a logical continuation of methods that OpenAI used when they trained the Dota agents:

https://arxiv.org/pdf/1912.06719v1

And, arguably, Facebook's unsupervised pre-training for their multi-modal speech-to-text models is kind of the same idea as unsupervised pre-training for a multi-modal text-to-image diffuser.

https://ai.meta.com/research/publications/wav2vec-2.0-a-fram...

zaptrem

a year ago

Yeah, I wonder whether this still saves compute if you include the compute used to train DINOV2/whatever representation model you'd like to use?

cubefox

a year ago

That's the question. More precisely, how does the new method compare to the classical one in terms of training compute and inference compute?

viktour19

a year ago

Diffusion models are already being evaluated using pretrained SSL models à la CLIP Score [1]. So it makes sense that one would incorporate that directly into training the model from scratch.

[1] https://huggingface.co/docs/diffusers/en/conceptual/evaluati...

GaggiX

a year ago

I wonder how well this technique works if the distribution of the training dataset between the diffusion model and the image encoder is quite different, for example if you use DinoV2 as the encoder but train the diffusion model on anime.

gdiamos

a year ago

Still waiting for a competitive diffusion llm

nextaccountic

a year ago

So I can't find that paper that was posted on HN that said that, when viewed under the right theoretical framework, asserts that diffusion and transformers are doing the same thing under a different basis.. am I misrembering something?

orbital-decay

a year ago

https://sander.ai/2024/09/02/spectral-autoregression.html

kleiba

a year ago

Why?

WithinReason

a year ago

Diffusion works significantly better for images than sequential pixel generation, there is a good chance it would work better for language as well.

Sequential generation used to be state of the art in 2016 and it's basically how current LLMs work:

https://arxiv.org/abs/1601.06759

kleiba

a year ago

Neural LMs used to be based on recurrent architectures until the Transformer came along. That architecture is not recursive.

I am not sure that a diffusion approach is all that suitable for generating language. Word are much more discrete than pixels.

WithinReason

a year ago

I meant sequential generation, I didn't mean using an RNN.

Diffusion doesn't work on pixels directly either, it works on a latent representation.

kleiba

a year ago

All NNs work on latent representations.

barrkel

a year ago

The contrast here is real: there are pixel space diffusion models and latent space diffusion models. Pixel space diffusion is slower because there's more redundant information.

famouswaffles

a year ago

The most popular method using autoregression in image generation space is to predict image patches/tokens and not pixels, though that still scales worse than diffusion.

A fairly new but promising approach for autoregression that seems to scale as well as diffusion is predicting the next image scale/resolution rather than the next image patch.

https://arxiv.org/abs/2404.02905

magicalhippo

a year ago

I had similar thoughts to you.

However diffusion models suck at details, like how many fingers on a hand, and with language words and characters matter, both which ones and where they are.

So while I'm sure diffusion could produce walls of text that look convincingly like a blog post at a glance say, I'm not sure it would hold up to anyone actually reading.

Faster convergence for diffusion models

22 Comments

fxtentacle

pedrovhb

psb217

byyoung3

psb217

byyoung3

fxtentacle

zaptrem

cubefox

viktour19

GaggiX

gdiamos

nextaccountic

orbital-decay

kleiba

WithinReason

kleiba

WithinReason

kleiba

barrkel

famouswaffles

magicalhippo