pseudosavant
5 hours ago
Not that LLMs are terribly latency sensitive (you wait on a lot of tokens), but what kind of latency impact does this have on requests that go through the proxy?
adilhafeez
5 hours ago
Short answer is latency impact is very minimal.
We use envoy as request handler which forwards request to local service written in rust. Envoy is proven to be high performance, low latency and highly efficient on request handling. If I have to put a number it would be in single digit ms per request. I will have more detailed benchmark in the coming days.
cotran2
5 hours ago
The model is compact 1.5B, most GPUs can serve it locally and has <100ms e2e latency. For L40s, its 50ms.