simonw
9 hours ago
It's notable how much easier it is to get things working now that the embargo has lifted and other projects have shared their integrations.
I'm running VLLM on it now and it was as simple as:
docker run --gpus all -it --rm \
--ipc=host --ulimit memlock=-1 \
--ulimit stack=67108864 \
nvcr.io/nvidia/vllm:25.09-py3
(That recipe from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?v... )And then in the Docker container:
vllm serve &
vllm chat
The default model it loads is Qwen/Qwen3-0.6B, which is tiny and fast to load.3abiton
4 hours ago
As someone who hot on early on the Ryzen AI 395+, are there any added value for the DGX Spark beside having cuda (compared to ROCm/vulkan)? I feel Nvidia fumbled the marketing, either making it sound like an inference miracle, or a dev toolkit (then again not enough to differentiate it from the superior AGX Thor).
I am curious about where you find its main value, and how would it fit within your tooling, and use cases compared to other hardware?
From the inference benchmarks I've seen, a M3 Ultra always come on top.
storus
2 hours ago
M3 Ultra has slow GPU and no HW FP4 support so its initial token decoding is going to be slow, practically unusable for 100k+ context sizes. For token generation that is memory bound M3 Ultra would be much faster, but who wants to wait 15 minutes to read the context? Spark will be much faster for initial token processing, giving you a much better time to first token, but then 3x slower (273 vs 800GB/s) in token generation throughput. You need to decide what is more important for you. Strix Halo is IMO the worst of both worlds at the moment due to having the worst specs in both dimensions and the least mature software stack.
justinclift
4 hours ago
It's very likely worth trying ComfyUI on it too: https://github.com/comfyanonymous/ComfyUI
Installation instructions: https://github.com/comfyanonymous/ComfyUI#nvidia
It's a webUI that'll let you try a bunch of different, super powerful things, including easily doing image and video generation in lots of different ways.
It was really useful to me when benching stuff at work on various gear. ie L4 vs A40 vs H100 vs 5th gen EPYC cpus, etc.
behnamoh
7 hours ago
I'm curious, does its architecture support all CUDA features out of the box or is it limited compared to 5090/6000 Blackwell?