Testing LLM Agents Like Software – Behaviour Driven Evals of AI Systems

19 pointsposted 8 hours ago
by PranoyP

13 Comments

jlukecarlson

2 hours ago

I appreciate the details shared in this paper but it'd be great if they open sourced their implementation!

mlop99

7 hours ago

Curious if the behaviour driven testing can be done by another LLM agent (or a group of agents) - one LLM agent testing another. Could lead to a self-improving loop?

user

5 hours ago

[deleted]

user

5 hours ago

[deleted]

shailendra145

7 hours ago

A powerful move beyond benchmarks — this paper redefines LLM evaluation through realistic, behavior-driven testing.

papz2k

7 hours ago

Very interesting work.