djoldman
a year ago
This is one in a long line of posts saying "we took a model and made it smaller" and now it can run with different requirements.
It is important to keep in mind that modifying a model changes the performance of the resulting model, where performance is "correctness" or "quality" of output.
Just because the base model is very performant does not mean the smaller model is.
This means that another model that is the same size as the new quantized model may outperform the quantized model.
Suppose there are equal sized big models A and B with their smaller quantized variants a and b. A being a more performant model than B does not guarantee a being more performant than b.
ttul
a year ago
While I think I agree that there are many posts here on HackerNews announcing a new model compression technique, your characterization above understates the technical innovations and practical impacts described in this MIT paper.
Unlike traditional model compression work that simply applies existing techniques, SVDQuant synthesizes several ideas in a comprehensive new approach to model quantization:
- Developing a novel outlier absorption mechanism using low-rank decomposition — this aspect alone seems quite novel, although the math is admittedly way beyond my level
- Combining SVD with smoothing in a way that specifically addresses the unique challenges of diffusion models
- Creating an innovative kernel fusion technique (they call it “Nunchaku”) that makes the theoretical benefits practically realizable, because without this, the extra computation required to implement the above steps would simply slow the model back down to baseline
This isn't just incremental improvement - the paper achieves several breakthrough results:
- First successful 4-bit quantization of both weights AND activations for diffusion models
- 3.5x memory reduction for 12B parameter models while maintaining image quality
- 3.0x speedup over existing 4-bit weight-only quantization approaches
- Enables running 12B parameter models on consumer GPUs that previously couldn't handle them
And, I’ll add, as someone who has been following the diffusion space quite actively for the last two years, the amount of creativity that can be unleashed when models are accessible to people with consumer GPUs is nothing short of astonishing.
The authors took pains to validate their approach by testing it against three models (Flux, PixArt-Sigma, and SDXL) and along several quality-comparison axes (FID score, Image Reward, LPIPS, and PSNR). They also did a proper ablation study to see the contribution of each component in their approach to image quality.
What particularly excites me about this paper is not the ability to run a model that eats 22GB of VRAM in just 7GB. The exciting thing is the prospect of running a 60GB model in 20GB of VRAM. I’m not sure whether anyone has or is planning to train such a monster, but I suspect that Midjourney, OpenAI, and Google all have significantly larger models running in their infrastructure than what can be run on consumer hardware. The more dimensions you can throw at image and video generation, the better things get.
djoldman
a year ago
I definitely agree that there may be some interesting advancements here.
I am trying to call attention to the models used for evaluation comparison. There are 3 factors: inference speed/latency, model size in total loaded VRAM, and model performance in terms of output.
Comparisons should address all of these considerations, otherwise it's easy to hide deficiencies.
Jackson__
a year ago
The site literally has a quick visual comparison near the top, which shows that theirs is the closest to 16bit performance compared to the others. I don't get what more you'd want.
https://cdn.prod.website-files.com/64f4e81394e25710d22d042e/...
djoldman
a year ago
These are comparisons to other quantizing methods. That is fine.
What I want to see is comparisons to NON-quantized models all with around the same VRAM along with associated inference latencies.
Also, we would want to see the same quantizing schemes applied to other base models.. because perhaps the paper's proposed quantizing scheme only beats others using a particular base model.
snovv_crash
a year ago
They tested the quantisation on 3 different models.
They also show it has little to no effect relative to fp16 on these models.
IMO that's enough. Comparison against smaller models is much less useful because you can't use the same random seeds. So you end up with a very objective "this is worse" based purely on aesthetic preferences of one person vs another. You already see this with Flux Schnell vs. the larger Flux models.
djoldman
a year ago
I disagree.
They report that their method produces a model that is 6.5 GB from flux (22.7GB). Why wouldn't you want to know how their 6.5GB model compares to other 6.5GB models?
Regarding aesthetic prefs: it's an open problem what an appropriate metric is for GenAI... LLM arena is widely regarded as a good way to measure LLMs and that's user preferences.
In any case, the authors report LPIPs etc. They could do the same for other small models.
snovv_crash
a year ago
LPIPS and similar don't work if the scene is different, as happens if the random seed doesn't match. This is why they can use it to compare the quantised network, but not against networks with reduced numbers of weights.
refulgentis
a year ago
I'm really confused, this looks like concern trolling because there's a live demo for exactly this A/B testing, that IIRC was near the top of the article, close enough it was the first link I clicked.
But you're quite persistent in that they need to address this, so it seems much more likely they silently added it after your original post, or you didn't click through, concern trolling would stay more vague
Dylan16807
a year ago
The demo is not what they're asking for. It compares original versus quantized. They want quantized versus a similar same-size in GB model.
aaronblohowiak
a year ago
>What I want to see is comparisons to NON-quantized models
isnt that the first image in the diagram / the 22GB model that took 111 seconds?
Dylan16807
a year ago
The next six words you didn't quote make all the difference.
boulos
a year ago
As others have replied, this is reasonable general feedback, but in this specific case the work was done carefully. Table 1 from the linked paper (https://arxiv.org/pdf/2411.05007) includes a variety of metrics, while an entire appendix is dedicated to quality comparisons.
By showing their work side-by-side with other quantization schemes, you can also see a great example of the flavor of different results you can get with these slight tweaks (e.g., ViDiT INT8) and that their quantization does a much better job in reproducing the "original" (Figure 15).
In this application, it's not strictly true that you care to have the same results, but this work does a pretty good job of it.
djoldman
a year ago
Agreed.
Once a model has been trained, I believe the main metrics people care about are
1. inference speed
2. memory requirements
3. quality of output.
There are usually tradeoffs here. Generally you get a lower memory requirement (a good thing), sometimes faster inference (a good thing), but usually a lower quality of output.
I don't think reproduction of original output is the typical goal.
lostmsu
a year ago
This is a very real concern. I've seen quantized models outputting complete garbage in LLMs. In most cases it definitely felt that a smaller unquantized model would do better. They must be included in every comparison.
E.g. compare quantized LLaMA 70B to unquantized LLaMA 8B.
Even better if the test model has a smaller version with similar byte size to the quantized larger one.
superkuh
a year ago
Not really. They quantized the activations here with their inference program which decreased compute as well as RAM usage (and required bandwidth). That's a big step.
tbalsam
a year ago
Did you...did you read the technical details? This is almost all they talk about, this method was created to get around.
Take a look, it's good stuff! Basically a LoRA to reconstruct outliers lost by quantization, helping keep the performance of the original model.