hackernews client

Nvidia greenboost: transparently extend GPU VRAM using system RAM/NVMe

222 pointsposted 4 days ago

42 Comments

0xbadcafebee

6 hours ago

You can already do this with some GPU drivers:

  GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amdttm.pages_limit=5242880 ttm.pages_limit=5242880"

One downside is your kernel isn't going to reserve that memory away from userland. You will still see all the memory at system level as "free". As the GPU driver starts using it, other apps/the OS will try to use the "free" memory, not knowing how much of it is in use (it may show up as "cache", or not at all). Then OOM killer starts going or programs start crashing, and at some point the OS tips over or GPU driver crashes. You can add loads of swap as a compromise and it works okay, if a bit slow.

In any case, loading a gigantic model just to use system RAM is absurdly slow (due to mem bandwidth), like 1-5 t/s, so it's not practical. It'd take a whole day to process one 86k token request. Just pay a cloud provider $0.01 to do it in 10 seconds.

jmward01

4 hours ago

The point is not how fast it is now. The point is that this opens new possibilities that can be built on. Potentially models that are trained with slightly different architectures to optimize to this use case. Possibly others come to improve this path. Possibly HW manufacturers make a few small adjustments that remove bottlenecks. Who knows, the next person may combine CPU compute with this mem sharing to get another token a second. Then the next person does predictive loading into memory to keep that bandwith 100% maxed and usable. Then the next does and the next does. Before you know it there is a real thing there that never existed.

This is a great project. I love the possibilities it hints at. Thanks for building it!

smallnamespace

3 hours ago

It’s architecturally not a good approach. System RAM is much slower so you should put data that doesn’t need to be used often on it. That knowledge is at the application layer. Adding a CUDA shim makes system RAM appear like VRAM, which gets things to run, but it will never run very well.

The benchmarks at the bottom mention memory tiering and manually controlling where things go, but if your application already does that, then you probably don’t also need a CUDA shim. The application should control the VRAM to system memory transfers with boring normal code.

timnetworks

an hour ago

Some people are not concerned with having it run the fastest, just having it run at all may be enough.

nl

5 hours ago

This is really interesting engineering, but I agree with the other commentators that the benchmarking makes it hard to understand the contribution various factors are having.

The ExLlamaV3 EXL3 2bpw (8 GB, full VRAM) row is an order of magnitude faster than the baseline - but the baseline seems to be the 32GB model running with the KV cache shared to system memory only (I think?)

But if a 8GB model gives sufficient quality then it seems like that would have worked without the shared memory thing?

I think the useful apples-to-apples benchmark is currently the Ollama + GreenBoost shim (baseline) (2-5 tps) vs ExLlamaV3 + GreenBoost cache (8–20 tps) comparison.

It would be really useful to see this compared with the existing llama CPU/memory offload. There is a note at the start ("Offload layers to CPU — works, but drops token/s by 5–10× because CPU RAM has no CUDA coherence") - but it is unclear if that 5-10x token speed drop is compared to running a model completely in GPU or compared to the greenboost approach.

I think it is vs GPU, in which case it seems likely the performance is similar to what greenboost is giving but probably much more stable.

kristianp

3 hours ago

ExLlamaV3 EXL3 2bpw is likely the 30b parameter GLM 4.7 Flash quantised down to 2 bits, the unstated assumption is that you need to check the 2bpw quantisation works well enough for your use case.

The reported size of the ModelOpt FP8, 16 GB, sounds wrong to me. If its 8 bits per parameter it is going to be a similar size to the glm-4.7-flash:q8_0. They repeat this a few times in the readme.

Havoc

5 hours ago

> The best strategy is to shrink the model until it fits — either with EXL3 quantization or ModelOpt PTQ — and use GreenBoost's DDR4 pool for KV cache only.

Does this make sense? I'd have thought the KV is guaranteed to be used 100% of the time while say in a MoE the same can't be said of the weights.

Though I suppose if you're shooting for huge context then having that allocation go into ram makes sense specially when its allocated but not used yet

alexeldeib

2 hours ago

KV cache is, well, a cache that can fill up and trigger eviction. You require enough space to execute at least 1 fwd pass of 1 request at your context length. KV cache hits reduce TTFT by avoiding prefill. You don’t get to skip decode.

MoE is kinda related in terms of lower usage requirements vs a dense model of same total param size, but I think your mental model is a bit off.

daneel_w

6 hours ago

Related, a couple of years ago: https://old.reddit.com/r/Amd/comments/15t0lsm/i_turned_a_95_...

"I turned a $95 AMD APU into a 16GB VRAM GPU and it can run stable diffusion!"

3abiton

6 hours ago

> it can generate a 50 steps 512x512 image around 1 minute and 50 seconds.

I have the 4650G APU, and the best way to describe it is: lacking of support. This was even more true 3 yo than now. rocm (is) was absolutely dogshit then, I know this because I tried to do the same when that post was made. You have to compile everything from scratch, get the relevant patches, and even then, xformers which is a library that accelerate diffusion model inferencing was not supported for renoir or rocm back then. Yes, you can generate an image, but it was much slower, and rigged with bugs. You couldn'rt update rocm because it broke compatibility, and it was partly the reason I got into nixos. That being said, those APUs are a power house. Nowadays I can run decent agentic workflows on them (I have 64gb of ddr4 ram, ie APU can suck as much as it needs with the latest linux kernels).

Just note, diffusion models are still second class citizens on AMD apus even GPUs. But then again, nothing close right now on the market except for what apple offers.

nl

5 hours ago

The Ryzen AI CPU/GPUs (Ryzan AI 395+ etc) seem to have increasing support - https://lemonade-server.ai/ now has support for the NPU as well as the combined CPU/GPU (which I guess is a APU but is different to the G series of APUs I think?)

But I'm always interested in first hand experiences of how good is it really - I'm pretty cynical about the idea that AMD actually knows what it takes to build good software end-to-end.

3abiton

5 hours ago

I also have one, and indeed support is very much frictionless now compared to a year ago. But again, not thanks to AMD, as initially it was purely community driven. Strix halo was not even supported by ROCm (officially), and we had to deal with therock images, then donato made the toolbox, and then lemonade came through. I am really surprised how AMD approached this. They made big promises, they threw the hardware out, it really is amazing piece of hardware given what you can do with it, but it was left hanging without support for AI stack for months even though it had it in its name. Contrast that with the DGX spark (yes it had and still had bugs in its kernels, but cuda worked on day 1) and you can see the difference. Nvidia is selling an ecosystem, AMD is selling hardware. I really hope AMD focus on the software layer more.

nl

29 minutes ago

I believe Lemonade is the AMD team right?

But yes I agree with you about their lack of prioritization for software!

ma2kx

5 hours ago

The physical bottleneck to system memory remains. Therefore, I assume that better results are achieved by manually adjusting which layers are offloaded.

I would prefer to use system memory to cache different models, focusing on things like embedding, rerankers, and TTS. This is sufficient to run a more complex RAG locally, for example, via Mem0, and then use a larger LLM via the cloud.

yjtpesesu2

6 hours ago

How does this differ from anything llama.cpp offers, regarding offloading layers? The repo consistently refers to "DDR4". Is there a reason DDR5 won't work with this?

svnt

6 hours ago

The readme opens with this:

> I have an RTX 5070 with 12 GB VRAM and I wanted to run glm-4.7-flash:q8_0, which is a 31.8 GB model. The standard options are:

> Offload layers to CPU — works, but drops token/s by 5–10× because CPU RAM has no CUDA coherence. You end up waiting. Use a smaller quantization — you lose quality. At q4_0 the model is noticeably worse on reasoning tasks.

> Buy a bigger GPU — not realistic for consumer hardware. A 48 GB card costs more than a complete workstation.

> None of those felt right, so I built an alternative: route the overflow memory to DDR4 via DMA-BUF, which gives the GPU direct access to system RAM over PCIe 4.0 without a CPU copy involved.

And then limps home with this caveat on the closest thing to a benchmark:

> The PCIe 4.0 link (~32 GB/s) is the bottleneck when the model overflows VRAM. The best strategy is to shrink the model until it fits — either with EXL3 quantization or ModelOpt PTQ — and use GreenBoost's DDR4 pool for KV cache only.

I think the reason it refers it to DDR4 is because that is how the user explained it to their coding agent. LLMs are great at perpetuating unnecessary specificity.

kcb

6 hours ago

CUDA has had managed memory that pages between VRAM and system RAM for a decade. Problem is doing so is unusably slow for AI purposes. Seems like an unnecessary layer here.

xienze

6 hours ago

Presumably it means that software doesn’t have to write the same sort of layer offloading support. It’ll “just work” as if you had X GB of VRAM all along.

yjtpesesu2

6 hours ago

so, magic?

armada651

3 hours ago

Doesn't Windows already do this by default? I can already run models bigger than my GPU VRAM and it will start using up to 50% of my system RAM as "shared memory". This is on a Desktop PC without a shared memory architecture.

Yokohiii

2 hours ago

The nvidia windows driver enables RAM swapping by default.

Great way to backstab you if you prefer inference speed.

3836293648

2 hours ago

I don't think Windows does this, but Ollama does

nodja

2 hours ago

NVIDIA's GPU drivers on windows 100% do this

https://i.imgur.com/c0a3vUy.png

paultendo

6 hours ago

Could be a very useful way to do some overnight tasks using spare RAM. Possibly things like LLM-based categorisation, labelling, data cleansing. That's what comes to mind for me anyway.

NooneAtAll3

13 minutes ago

nvidia failed to provide gpu with actually meaningful amount of vram

and instead of improving the actual product, it decided to "solve the problem in software"

I expect this greenboost to fall and burn, honestly...

cma

9 minutes ago

> it decided to "solve the problem in software"

This isn't made by nvidia

Insanity

5 hours ago

Extend your VRAM using RAM, then extend your RAM using Swap.

system2

2 hours ago

And burn the swap pagesys file to a rewritable DVD to complete the cycle. It will be super fast that way.

SV_BubbleTime

3 hours ago

If you are doing video models, this is an excellent way to murder your SSD.

Do not put swap on an SSD you care about at all.

rvz

20 minutes ago

> Do not put swap on an SSD you care about at all.

This.

Many people rediscovering what the purpose of swap files are, but will still find a way to abuse it without knowing that they are actually destroying their SSD.

Insanity

2 hours ago

I was writing it somewhat tongue-in-cheek and not as a serious suggestion. But thanks for adding the disclaimer, that's good advice!

yjftsjthsd-h

7 hours ago

Previously: https://news.ycombinator.com/item?id=47384557

(Still cool, still would benefit from better benchmarks)

I really appreciate thriftful & resourceful points of view. Exploring what if, looking for use is such a great virtue.

bigwheels

7 hours ago

Can you elaborate beyond the shallow/superficial dismissal?