Why Rust won't make your embedding model inference fast

1 pointsposted 8 hours ago
by fm1320

3 Comments

storystarling

7 hours ago

Fair point on HBM, but in a production system the orchestration layer is often the actual bottleneck. I've found that keeping the GPU fed requires a level of concurrency and stability that is hard to tune in Python. Rust is useful here not for the inference itself, but for ensuring the request pipeline doesn't choke while the GPU is waiting for data.

fm1320

6 hours ago

Have you tried Mojo for this purpose? Seems like it should work as well? I don’t have experience with it though

user

8 hours ago

[deleted]