Testing LLM Agents Like Software – Behaviour Driven Evals of AI Systems

21 pointsposted 3 months ago
by PranoyP

14 Comments

mlop99

3 months ago

Curious if the behaviour driven testing can be done by another LLM agent (or a group of agents) - one LLM agent testing another. Could lead to a self-improving loop?

shailendra145

3 months ago

A powerful move beyond benchmarks — this paper redefines LLM evaluation through realistic, behavior-driven testing.

jlukecarlson

3 months ago

I appreciate the details shared in this paper but it'd be great if they open sourced their implementation!

user

3 months ago

[deleted]

user

3 months ago

[deleted]

papz2k

3 months ago

Very interesting work.