MetaXuda – 1.1 Tops GPU Runtime for Apple Silicon ML (Rust and Metal)

2 pointsposted 13 hours ago
by perinban

1 Comments

perinban

13 hours ago

Hey HN! I built MetaXuda after getting tired of "buy Windows for ML" advice when working on Apple Silicon.

Problem: Most ML libraries (XGBoost, scikit-learn) are CUDA-only with zero macOS GPU support. Existing translation layers (ZLUDA) add overhead.

Solution: Native Rust + Metal runtime from scratch.

Key features: - 1.1 TOPS throughput (95% of M1 theoretical peak) - Tokio async scheduler with zero race conditions - Multi-tier memory: GPU → RAM → SSD (handles 100GB+ workloads) - 230+ GPU ops (math, transform, ML primitives) - CUDA-style APIs for easy library integration - Bypasses Numba execution path

Technical approach: - No CUDA/ZLUDA reuse (licensing + perf reasons) - PyO3 wrapper for Python - Arrow-based quantization in-kernel - 93.37% GPU cap to prevent macOS starvation

Known limitations: - Metal stream limits still undocumented by Apple - CUDA API coverage incomplete (in progress) - Some blocking favors stability over raw speed

pip install metaxuda

Open to questions on Metal vs CUDA architecture, Rust async patterns, or Apple GPU quirks. Also looking for feedback on scheduler design.

License inquiries: p.perinban@gmail.com