GPT-5.2, Grok 4.1, and DeepSeek v3.2 compare as Santa agents

4 pointsposted 11 hours ago
by _josh_meyer_

2 Comments

_josh_meyer_

11 hours ago

SantaBench, a fun benchmark with a serious methodology. The task: play a cheeky Santa agent who researches users online and roasts them based on their social media.

_josh_meyer_

10 hours ago

OP here -- I work at Veris and built this. Happy to answer questions about the methodology!