hackernews client

20 days ago

Hi HN, author here.

I built this project because I wanted to understand the low-level mechanics of LLMs and how FFI overhead differs between languages.

Some key takeaways:

Architecture: It's a 6.9B MoE model implemented purely in Rust, Go, and Python.

Shared CUDA: All three languages bind to the exact same CUDA kernels (no PyTorch/TensorFlow).

Performance: I was surprised to see how Go handles cgo overhead compared to Rust's FFI in this specific workload.

I know it's reinventing the wheel, but it was a great way to learn. Happy to answer any questions about the implementation or the FFI architecture!

1 Comments