Show HN: Mini-vLLM in ~500 lines of Python

5 pointsposted a month ago
by ubermenchh

4 Comments

zahlman

a month ago

I'm not familiar with the thing you're recreating (I gather it's something to do with getting better responses out of LLMs by manipulating the context or something like that?) but I appreciate that you haven't, like so many others, dropped ten paragraphs of Markdown-formatted press release (without bothering to check whether the formatting even works here) on us echoing a bunch of marketing-speak in a README.

ubermenchh

a month ago

Haha, i just wanted my repo to be out here. If someone finds it interesting they can always just check the repo. And you're close, its about getting faster responses from the model by manipulating the request queues and memory.

dmarwicke

a month ago

does this do continuous batching or just static? couldn't tell from the code

ubermenchh

a month ago

yes it does continous batching along with paged attention and prefix caching. i am also goint to be adding some more inference techniques