hackernews client

storystarling

14 days ago

Fair point on HBM, but in a production system the orchestration layer is often the actual bottleneck. I've found that keeping the GPU fed requires a level of concurrency and stability that is hard to tune in Python. Rust is useful here not for the inference itself, but for ensuring the request pipeline doesn't choke while the GPU is waiting for data.

fm1320

14 days ago

Have you tried Mojo for this purpose? Seems like it should work as well? I don’t have experience with it though

Why Rust won't make your embedding model inference fast

3 Comments

storystarling

fm1320

user