hackernews client

subhadipmitra

2 months ago

Hey HN, I built this because most LLM eval tools assume single-machine execution. When you need to evaluate against millions of examples (customer tickets, documents, etc.), they don't scale without significant duct-taping.

  spark-llm-eval runs natively on Spark - not "Spark as an afterthought" but distributed evaluation as the primary design goal.

  Key features:
  - Distributed inference via Pandas UDFs, scales linearly with executors
  - Statistical rigor by default: bootstrap CIs, paired t-tests, effect sizes
  - Multi-provider: OpenAI, Anthropic, Gemini, vLLM
  - Delta Lake integration for versioned results with lineage

  pip install spark-llm-eval

  The main gap I'm filling: "I have 2M labeled examples and need to know if Model A is statistically significantly better than Model B." Most frameworks give you point estimates; this gives you confidence intervals and significance tests.

  Blog post with architecture details: https://subhadipmitra.com/blog/2025/building-spark-llm-eval/

  Happy to answer questions about the implementation - rate limiting in distributed contexts was surprisingly tricky.

Show HN: Spark-LLM-eval – Distributed LLM evaluation for Spark

1 Comments

subhadipmitra