hackernews client

SVDQuant: 4-Bit Quantization Powers 12B Flux on a 16GB 4090 GPU with 3x Speedup

179 pointsposted a year ago

(hanlab.mit.edu)

66 Comments

djoldman

a year ago

This is one in a long line of posts saying "we took a model and made it smaller" and now it can run with different requirements.

It is important to keep in mind that modifying a model changes the performance of the resulting model, where performance is "correctness" or "quality" of output.

Just because the base model is very performant does not mean the smaller model is.

This means that another model that is the same size as the new quantized model may outperform the quantized model.

Suppose there are equal sized big models A and B with their smaller quantized variants a and b. A being a more performant model than B does not guarantee a being more performant than b.

While I think I agree that there are many posts here on HackerNews announcing a new model compression technique, your characterization above understates the technical innovations and practical impacts described in this MIT paper.

Unlike traditional model compression work that simply applies existing techniques, SVDQuant synthesizes several ideas in a comprehensive new approach to model quantization:

- Developing a novel outlier absorption mechanism using low-rank decomposition — this aspect alone seems quite novel, although the math is admittedly way beyond my level

- Combining SVD with smoothing in a way that specifically addresses the unique challenges of diffusion models

- Creating an innovative kernel fusion technique (they call it “Nunchaku”) that makes the theoretical benefits practically realizable, because without this, the extra computation required to implement the above steps would simply slow the model back down to baseline

This isn't just incremental improvement - the paper achieves several breakthrough results:

- First successful 4-bit quantization of both weights AND activations for diffusion models

- 3.5x memory reduction for 12B parameter models while maintaining image quality

- 3.0x speedup over existing 4-bit weight-only quantization approaches

- Enables running 12B parameter models on consumer GPUs that previously couldn't handle them

And, I’ll add, as someone who has been following the diffusion space quite actively for the last two years, the amount of creativity that can be unleashed when models are accessible to people with consumer GPUs is nothing short of astonishing.

The authors took pains to validate their approach by testing it against three models (Flux, PixArt-Sigma, and SDXL) and along several quality-comparison axes (FID score, Image Reward, LPIPS, and PSNR). They also did a proper ablation study to see the contribution of each component in their approach to image quality.

What particularly excites me about this paper is not the ability to run a model that eats 22GB of VRAM in just 7GB. The exciting thing is the prospect of running a 60GB model in 20GB of VRAM. I’m not sure whether anyone has or is planning to train such a monster, but I suspect that Midjourney, OpenAI, and Google all have significantly larger models running in their infrastructure than what can be run on consumer hardware. The more dimensions you can throw at image and video generation, the better things get.

djoldman

a year ago

I definitely agree that there may be some interesting advancements here.

I am trying to call attention to the models used for evaluation comparison. There are 3 factors: inference speed/latency, model size in total loaded VRAM, and model performance in terms of output.

Comparisons should address all of these considerations, otherwise it's easy to hide deficiencies.

Jackson__

a year ago

The site literally has a quick visual comparison near the top, which shows that theirs is the closest to 16bit performance compared to the others. I don't get what more you'd want.

https://cdn.prod.website-files.com/64f4e81394e25710d22d042e/...

djoldman

a year ago

These are comparisons to other quantizing methods. That is fine.

What I want to see is comparisons to NON-quantized models all with around the same VRAM along with associated inference latencies.

Also, we would want to see the same quantizing schemes applied to other base models.. because perhaps the paper's proposed quantizing scheme only beats others using a particular base model.

snovv_crash

a year ago

They tested the quantisation on 3 different models.

They also show it has little to no effect relative to fp16 on these models.

IMO that's enough. Comparison against smaller models is much less useful because you can't use the same random seeds. So you end up with a very objective "this is worse" based purely on aesthetic preferences of one person vs another. You already see this with Flux Schnell vs. the larger Flux models.

djoldman

a year ago

I disagree.

They report that their method produces a model that is 6.5 GB from flux (22.7GB). Why wouldn't you want to know how their 6.5GB model compares to other 6.5GB models?

Regarding aesthetic prefs: it's an open problem what an appropriate metric is for GenAI... LLM arena is widely regarded as a good way to measure LLMs and that's user preferences.

In any case, the authors report LPIPs etc. They could do the same for other small models.

snovv_crash

a year ago

LPIPS and similar don't work if the scene is different, as happens if the random seed doesn't match. This is why they can use it to compare the quantised network, but not against networks with reduced numbers of weights.

refulgentis

a year ago

I'm really confused, this looks like concern trolling because there's a live demo for exactly this A/B testing, that IIRC was near the top of the article, close enough it was the first link I clicked.

But you're quite persistent in that they need to address this, so it seems much more likely they silently added it after your original post, or you didn't click through, concern trolling would stay more vague

Dylan16807

a year ago

The demo is not what they're asking for. It compares original versus quantized. They want quantized versus a similar same-size in GB model.

aaronblohowiak

a year ago

>What I want to see is comparisons to NON-quantized models

isnt that the first image in the diagram / the 22GB model that took 111 seconds?

Dylan16807

a year ago

The next six words you didn't quote make all the difference.

boulos

a year ago

As others have replied, this is reasonable general feedback, but in this specific case the work was done carefully. Table 1 from the linked paper (https://arxiv.org/pdf/2411.05007) includes a variety of metrics, while an entire appendix is dedicated to quality comparisons.

By showing their work side-by-side with other quantization schemes, you can also see a great example of the flavor of different results you can get with these slight tweaks (e.g., ViDiT INT8) and that their quantization does a much better job in reproducing the "original" (Figure 15).

In this application, it's not strictly true that you care to have the same results, but this work does a pretty good job of it.

djoldman

a year ago

Agreed.

Once a model has been trained, I believe the main metrics people care about are

1. inference speed

2. memory requirements

3. quality of output.

There are usually tradeoffs here. Generally you get a lower memory requirement (a good thing), sometimes faster inference (a good thing), but usually a lower quality of output.

I don't think reproduction of original output is the typical goal.

lostmsu

a year ago

This is a very real concern. I've seen quantized models outputting complete garbage in LLMs. In most cases it definitely felt that a smaller unquantized model would do better. They must be included in every comparison.

E.g. compare quantized LLaMA 70B to unquantized LLaMA 8B.

Even better if the test model has a smaller version with similar byte size to the quantized larger one.

superkuh

a year ago

Not really. They quantized the activations here with their inference program which decreased compute as well as RAM usage (and required bandwidth). That's a big step.

tbalsam

a year ago

Did you...did you read the technical details? This is almost all they talk about, this method was created to get around.

Take a look, it's good stuff! Basically a LoRA to reconstruct outliers lost by quantization, helping keep the performance of the original model.

mesmertech

a year ago

Demo on actual 4090 with flux schnell for next few hours: https://5jkdpo3rnipsem-3000.proxy.runpod.net/

Its basically H100 speeds with 4090, 4.80it/s. 1.1 sec for flux schenll(4 steps) and 5.5 seconds for flux dev(25 steps). Compared to normal speeds(comfyui fp8 with "--fast" optimization") which is 3 seconds for schnell and 11.5 seconds for dev

AzN1337c0d3r

a year ago

It's worth noting this is laptop 4090 GPU which is more like in the range of desktop 4070 performance.

mesmertech

a year ago

This specific link I shared is the quant running on a 4090 I rented on runpod, I have no affiliation with the repo itself

qeternity

a year ago

The compute differential between an H100 and a 4090 is not huge. The main single GPU benefits are larger memory (and thus memory bandwidth) and native fp8. But these matter less for diffusion models.

mesmertech

a year ago

Thats what I thought as well, but FP8 is much faster on h100, like 2x-3x. You can check it/s here: https://github.com/aredden/flux-fp8-api

Its why fal, replicate, pretty much all big diffusion api providers use h100

tldr; 4090 is max 3.51 it/s even with all the current optimizations. h100 is 11.5it/s with all optimizations, and even without its 6.1 it/s

boroboro4

a year ago

Providers use h100 because using 4090 in DCs is grey area, since Nvidia doesn't permit it.

Paper discussing here is using 4 bit compute, which is 4x on 4090 in comparison with bf16 compute, while h100 doesn't have this at all (i.e. best you can get is 2x compute with fp8). So this paper will even out difference between those two to some extent. If to judge by theoretical numbers - H100 has 1979 TFLOPs fp8 compute, and 4090 has 1321 TOPS. Which puts it around ~65% of performance. Given the price of it ~$2K compared to H100s ~$30K this seems like a very good deal.

But again, no 4090 in DCs.

bufferoverflow

a year ago

Damn, it runs very fast.

yakorevivan

a year ago

Hey, can you share the inference code please? Thanks..

superkuh

a year ago

https://github.com/mit-han-lab/nunchaku

oneshtein

a year ago

Cannot compile it locally on Fedora 40:

  nunchaku/third_party/spdlog/include/spdlog/common.h(144): error: namespace "std" has no member "function"
  using err_handler = std::function<void(const std::string &err_msg)>;
                                   ^

mesmertech

a year ago

Yea its a pain, I'm trying to make an api endpoint for a website I own, and working on a docker image. This is what I have for now that "just" works:

the conda always yes thing makes sure that you can just paste the script and it all works instead of having to press "y" for each install. Also if you don't feel like installing a wheel from random person on the internet, replace that step with "pip install -e ." as the repo suggests. I compiled that one with cuda 12.4 cause that was the part takes the most time and is what most often seems to be breaking.

Also I'm not sure if this will work on Fedora, I tried this on a runpod machine with 4090(apparently it only works on few cards, 3090, 4090, a100 etc) with Cuda 12.4 on host machine and "runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04" this image as base.

EDIT: using pastebin instead as HN doesn't seem to jive with code blocks: https://pastebin.com/zK1z0UdM

oneshtein

a year ago

Almost working:

  [2024-11-09 19:33:55.214] [info] Initializing QuantizedFluxModel
  [2024-11-09 19:33:55.359] [info] Loading weights from ~/.cache/huggingface/hub/models--mit-han-lab--svdquant-models/snapshots/d2a46e82a378ec70e3329a2219ac4331a444a999/svdq-int4-flux.1-schnell.safetensors
  [2024-11-09 19:34:01.432] [warning] Unable to pin memory: invalid argument
  [2024-11-09 19:34:02.143] [info] Done.
  terminate called after throwing an instance of 'CUDAError'
    what():  CUDA error: pointer does not correspond to a registered memory region (at /nunchaku/src/Serialization.cpp:32)

mesmertech

a year ago

prolly make sure your host machine cuda is also 12.4 and if not, update the other cuda versions I have on the pastebin to the one you have. I don't think it works with cuda 11.8 tho, remember trying it once

but yea, can't help you outside of runpod, I haven't even tried this on my home PCs yet. for my usecase of serverless API, it seems to work

gyrovagueGeist

a year ago

This problem seems like it would be very similar to the Low-Rank + Sparse decompositions that used to be popular in audio-visual filtering.

notarealllama

a year ago

I'm convinced the path to ubiquity (such as embedded in smartphones) is quantization.

I had to int4 a llama model to get it to properly run on my 3060.

I'm curious, how much resolution / significant digits do we actually need for most genAI work? If you can draw a circle with 3.14, maybe it's good enough for fast and ubiquitous usage.

sigmoid10

a year ago

Earlier this year there was a paper from Microsoft where they trained a 1.58 bit (every parameter being ternary) LLM that matched the performance of 16 bit models. There's also other research that you can prune up to 50% of layers with minimal loss of performance. Our current training methods are just incredibly crude and we will probably look back on those in the future and wonder how this ever worked at all.

llm_trw

a year ago

None of those papers actually use quantized training, they are all about quantized inference.

Which is rather unfortunate as it means that the difference between what you can train locally and what you can run locally is growing ever larger.

danielEM

a year ago

Indeed. I think "AI gold rush" sucks anyone with any skills in this area into it with relatively good pay, so there are no, or almost no people outside of big tech and startups to counterbalance direction where it moves. And as a side note, big tech is and always was putting their agenda first in developing any tech or standards and that usually makes milking on investments as long as possible, not necessary moving things forward.

llm_trw

a year ago

There's more to it than that.

If you could train models faster, you’d be able to build larger, more powerful models that outperform the competition.

The fact that Llama 3 is significantly over trained than what was considered ideal even three years ago shows there's a strong appetite for efficient training. The lack of progress isn’t due to a lack of effort. No one has managed to do this yet because no one has figured out how.

I built 1-trit quantized models as a side project nearly a decade ago. Back then, no one cared because models weren’t yet using all available memory, and on devices where memory was fully utilized, compute power was the limiting factor. I spend much longer trying to figure out how to get 1-trit training to work and I never could. Of all the papers and people in the field I've talked to, no one else has either.

p1esk

a year ago

People did care back then. This paper had jumpstarted the whole model compression field (which used to be a hot area of research in early 90s): https://arxiv.org/abs/1511.00363

Before that, in 2012, Alexnet had to be partially split into two submodels, running on two GPUs (using a form of interlayer grouped convolutions) because it could not fit in 3GB of a single 580 card.

Ternary networks appeared in 2016. Unless you mean you actually tried to train in ternary precision - clearly not possible with any gradient based optimization methods.

sixfiveotwo

a year ago

> I spend much longer trying to figure out how to get 1-trit training to work and I never could.

What did you try? What were the research directions at the time?

llm_trw

a year ago

This is a big question that needs a research paper worth of explanation. Feel free to email me if you care enough to have a more in-depth discussion.

sixfiveotwo

a year ago

Sorry, I understand it was a bit intrusively direct. To bring some context, I toyed a little with neural networks a few years ago and wondered myself about this topic of training a so called quantized network (I wanted to write a small multilayer perceptron based library parameterized by the coefficient type - floating point or integer of different precision), but didn't implement it. Since you mentioned your own work in that area, it picked my interest, but I don't want to waste your time unnecessarily.

llm_trw

a year ago

Someone posted a paper that I didn't know about, but goes through pretty much all the work I did in the space: https://news.ycombinator.com/item?id=42095999

It's missing the colourful commentary that I'd usually give, but alas, we can't have it all.

sixfiveotwo

a year ago

thank you, that looks awesome.

sigmoid10

a year ago

That's wrong. I don't know where you got that information from, because it is literally the opposite of what is shown in the Microsoft paper mentioned above. They explicitly introduced this extreme quantization during training from scratch and show how it can be made stable.

llm_trw

a year ago

I got it from section 2.2

> The number of model parameters is slightly higher in the BitLinear setting, as we both have 1.58-bit weights as well as the 16-bit shadow weights. However, this fact does not change the number of trainable/optimized parameters in practice.

https://arxiv.org/html/2407.09527v1

buildbot

a year ago

Exactly as xnornet was doing way back in 2016 - shadow 32bit weights, quantized to 1 bit during the forward pass.

https://arxiv.org/abs/1603.05279

I personally have a pretty negative opinion of the bitnet paper.

llm_trw

a year ago

Thanks for the citation, I did my work in the area around 2014 and never looked back. That's a very good summary of the state of the field as I remember it.

sigmoid10

a year ago

What? That's the wrong paper. It is not even from Microsoft. This is it: https://www.microsoft.com/en-us/research/publication/bitnet-...

>we introduce BitLinear as a drop-in replacement of the the nn.Linear layer in order to train 1-bit weights from scratch

llm_trw

a year ago

Section 2.2 from your paper, with less clarity and more obfuscation:

>While the weights and the activations are quantized to low precision, the gradients and the optimizer states are stored in high precision to ensure training stability and accuracy. Following the previous work [ LSL+21 ], we maintain a latent weight in a high-precision format for the learnable parameters to accumulate the parameter updates. The latent weights are binarized on the fly during the forward pass and never used for the inference process.

https://arxiv.org/pdf/2310.11453

The other paper had a much nicer and clearer introduction to bitlinear than the original Microsoft paper, which is why I used it. Uncharitably you might say that they aren't hiding the lead 10 paragraphs in.

sigmoid10

a year ago

They are not hiding anything, because this is standard behaviour for all current optimisers. You still get a massive memory improvement from lower bit model weights during training.

llm_trw

a year ago

I'm sorry but you just don't understand what the paper is saying.

halJordan

a year ago

Do you want a cookie for joining the overwhelming majority?

Necessary precision depends on, unsurprisingly, what you're truncating. Flux drops off around q6. Text generation around q4.

The llms apple are putting in iphones are q4 3b models.

user

a year ago

[deleted]

xrd

a year ago

Can someone explain this sentence from the article:

  Diffusion models, however, are computationally bound, even for single batches, so quantizing weights alone yields limited gains.

llm_trw

a year ago

Diffusion requires a lot more computation to get results compared to transformers. Naively when I'm running a transformer locally I get about 30% GPU utilization, when I'm running a diffusion model I'm getting 100%.

This means that the only saving you're getting in speed for a diffusion model is being able to do more effective flops since the floats are smaller, e.g. instead of doing one 32bit multiplication, you're doing 8 4bit ones.

By comparison for transformers you not only gain the flop increase, but also the improvement in memory shuffling that they do, e.g. it also takes you 8 times less time to load the memory into working memory from vram.

The above is a vast over simplification and in practice will have more asterisks than you can shake a stick at.

flutetornado

a year ago

GPU workloads are either compute bound (floating point operations) or memory bound (bytes being transferred across memory hierarchy.)

Quantizing in general helps with the memory bottleneck but does not help in reducing computational costs, so it’s not as useful for improving performance of diffusion models, that’s what it’s saying.

pkAbstract

a year ago

Exactly. The smaller bit widths from quantization might marginally decrease the compute required for each operation, but they do not reduce the overall volume of operations. So, the effect of quantization is generally more impactful on memory use than compute.

superkuh

a year ago

Except in this case they quantized both the parameters and the activations leading to decreased compute time too.

boulos

a year ago

The next sentence there:

> To achieve measured speedups, both weights and activations must be quantized to the same bit width; otherwise, the lower precision is upcast during computation, negating any performance benefits.

tries to explain that.

What it means though is that if you only store the inputs in lower precision, but still upcast to say bf16 or fp32 to perform the operation, you're not getting any computational speedup. In fact, you're paying for upconverting and then downconverting afterwards.

atlex2

a year ago

Seriously nobody thought to use SVD on these weight matrices before?

liuliu

a year ago

I did try, but in a wrong way (try to SVD quantization error to recover quality (I.e. SVD(W - Q(W)))). The lightbulb moment in this paper is to do SVD on W and then quantize the remaining.

scottmas

a year ago

Possible to run this in ComfyUI?

vergessenmir

a year ago

The repo has sample code and it is fairly easy to create a node that will do it.

You won't however have access to usual sampler, latent image, Lora nodes to do anything beyond basic t2i

doctorpangloss

a year ago

Why? There is nothing to customize with Flux.

thot_experiment

a year ago

What do you mean there's nothing to customize with Flux, can you expand on this claim?

DeathArrow

a year ago

But doesn't quantization give worse results? Don't you trade quality for memory footprint?

timnetworks

a year ago

They're saying this method essential does not, even when mixed with low rank models on top. "Notably, while the original BF16 model requires per-layer CPU offloading on the 16GB laptop 4090, our INT4 model fits entirely in GPU memory, resulting in a 10.1× speedup by avoiding offloading."

This is the whole magic, the rest of the workflow doesn't need to unload and flush memory, causing big delays for jobs.