Hackernews
new
show
ask
jobs
Measuring Security Without Fooling Ourselves: Why Benchmarking Agents Is Hard
3 points
posted 11 hours ago
by Timofeibu
(arxiv.org)
No comments yet