I Beat Nvidia NCCL by 2.4x

1 pointsposted 6 hours ago
by venkat_2811

2 Comments

venkat_2811

6 hours ago

100% OSS, MIT License. YALI - Yet Another Low-Latency Implementation. Achieves 80-85% Speed-of-Light SW efficiency by using ultra low-latency primitives for p2p all_reduce_sum comms collective. Very important operation in multi-gpu llm training and inference

venkat_2811

6 hours ago

Wisdom from CPU land translate well to GPUs. Static Scheduling, Pre-fetching, 3-Stage Double-Buffering, Pre-allocation & memory ordering in custom CUDA kernel helps outperform NVIDIA NCCL. Experimental integration in vllm.rs shows ~20% prefill and ~10% decode latency improvements (TTFT & TPOT)