BYO – A multi-agent runtime optimized for parallel inference

2 pointsposted 4 hours ago
by Yarden_Bruch_El

5 Comments

Shahaf_Wieder

4 hours ago

You're burying the lede: SOTA 'Reasoning Models' (o1/GPT-4) are actually unusable for agent swarms because inference latency kills the recursion loop.

The real alpha here is Parallel Consensus. Running 5 Llama-3 instances via vLLM to critique each other at <200ms TTFT (Time To First Token) beats a single, slow GPT-4 wrapper every time.

Error correction belongs in the orchestration layer, not the model weights. Is the 'One Giant Model' era finally over for agents?

Yarden_Bruch_El

4 hours ago

Spot on. We found that ensembles of small models often beat a single large model.

The catch is VRAM. You can't run parallel swarms efficiently without PagedAttention. We rely on vLLM to share the KV cache for the system prompt—otherwise, spinning up 5 agents for a consensus vote would instantly OOM the GPU.

Yarden_Bruch_El

4 hours ago

Hi HN, We built a platform for orchestrating multi-agent debates (e.g., "Security" vs. "Refactoring" experts). The Challenge: Standard sequential agent chains (A -> B -> C) are too slow for real-time chat. The Fix (vLLM): We built a custom inference layer on top of vLLM to solve the bottleneck: Parallelism: We use continuous batching to generate multiple agent responses simultaneously rather than waiting for sequential turns. Memory: PagedAttention allows our agents to share the KV cache for the common context/system prompts, drastically reducing VRAM usage. We’d love feedback on the responsiveness. Create an expert, start a debate, and let us know if the parallel inference makes the conversation feel fluid enough.

tomer124

3 hours ago

I'm skeptical. vLLM is a throughput engine, not a latency engine. For small batches, TensorRT-LLM smokes it. Also, 'parallel inference' implies a race condition on the context window. If Agent A and B generate simultaneously based on stale state, aren't you just generating 5 divergent hallucinations at once? How do you resolve the merge conflict?

Yarden_Bruch_El

3 hours ago

Thanks for the feedback—that's a solid critique. You're right that TRT-LLM wins on raw latency, but we chose vLLM for the flexibility to hot-swap LoRA adapters dynamically. Regarding the 'race condition': we actually view that divergence as a feature to prevent sycophancy (agents biasing each other). It’s effectively Map-Reduce for conversation.