In your forward pass section you give a lot of emphasis to FlashAttention, but it might be worth mentioning Paged Attention as well (which was the paper written by the vLLM authors and I believe was the genesis of the project). PA-style block tables are now supported in most fused attention kernels, but vLLM originally came up with it and it's the main reason why vLLM has such high throughput!
Thank you! We have incorporated your suggestion.
Thanks for writing the article!
I didn't quite get
Note that during the prefill phase, all prompt tokens from a request can be processed in one batch. This is possible because the query (Q) tensors, calculated from the tokens immediately before them, are available for each prompt token position.
I know that in practice prefill is much faster than inference. Would watching the 2h video from Karpathy help me understand why?
That snippet is trying to say that you can calculate KV for all the input tokens at once, and you don't need to loop over them since you have them all available.
Instead for decode, you need to sequentially generate each token.
And on the topic of prefill: Do you know what the role of GPUs is vs. in inference?
Prefill is part of Inference. It's the first major step where you calculate all the keys and values for the input tokens.
Decode is the next major step where you start generating output tokens one at a time.
Both run on GPUs but have slightly different workloads
1. Prefill has very little I/o from VRAM to HBM and more compute
2. Decode is light on compute but have to I/o the keys and values computed in the prefill stage for every output token
Doesn't decode also need to stream in the whole of the model weights, thus very I/O heavy?
Yes, decoding is very I/O heavy. It has to stream in the whole of the model weights from HBM for every token decoded. However, that cost can be shared between the requests in the same batch. So if the system has more GPU RAM to hold larger batches, the I/O cost per request can be lowered.
Great write up, it would be interesting to see a lot of those covered features in comparison to other frameworks!
Thanks for this! Learnt a lot.
Curious to understand how do we ensure that the same model instance gets requests from the same client/user? Since conversations are stateful and the model needs context from previous turns of the conversation.
Is this happening at the load balancer layer?
They’re not stateful, you submit the entire history with every call. Caching of prompts etc makes it important for performance to have sticky sessions or smth at the load balancer layer
Yes, typically users send the newest user message and the full conversation history. These combined become the prompt.
Our API endpoint will try to route requests that has the same prefix to the same vLLM instance (similar to longest prefix matching in networking), and hopefully there are still some KV caches for part of the prompt there.