ben_s
2 months ago
(author of the blog post here)
For me, the hardest part was virtualizing GPUs with NVLink in the mix. It complicates isolation while trying to preserve performance.
AMA if you want to dig into any of the details.
spwa4
2 months ago
Would it be possible to implement "virtual memory" for a GPU this way? Let's say you have GPUs at 30% utilization, but memory limited. Could you run 2 workloads by offloading the GPU memory when not in use?
ben_s
2 months ago
Once you oversubscribe GPU memory, performance usually collapses. Frameworks like vLLM can explicitly offload things like the KV cache to CPU memory, but that's an application-level tradeoff, not transparent GPU virtual memory.
checker659
2 months ago
Isn't SR-IOV a thing with these big GPUs? Or, is it that you're not concerned with fractional granularity?
ben_s
2 months ago
In this article, we're primarily concerned with whole-GPU or multi-GPU partitions that preserve NVLink bandwidth, rather than finer-grained fractional sharing of a single GPU.