hackernews client

DGX-Spark-Finetune-LLM

2 pointsposted 2 months ago

1 Comments

waybarrios

2 months ago

I built a toolkit to fine-tune LLMs using LoRA + native 4-bit quantization on NVIDIA's new Blackwell GPUs (DGX Spark with GB10).

  Key features:
  - NVFP4 (4-bit) via Transformer Engine - fastest option
  - MXFP8 (8-bit) for higher precision
  - bitsandbytes FP4 fallback for any CUDA GPU
  - ~240MB LoRA adapters instead of ~6GB full models

  Tested on DGX Spark (128GB VRAM). Training SmolLM3-3B takes ~70GB VRAM with NVFP4.