ofirpress
9 hours ago
[I'm on the SWE-bench team] Multiple people have looked into this, for example right in that thread: https://github.com/SWE-bench/SWE-bench/issues/465#issuecomme...
This issue had affected a tiny fraction of existing agents in a tiny fraction of their runs. And we've now issued a fix.
This is a natural part of running a benchmark, I'm sure tiny things like this will keep on getting discovered and we'll keep on fixing them. This doesn't change the overall picture or trends at all.
comex
9 hours ago
The comment you link to says that "we only performed a quick preliminary search" and "We do not have a method for automatically checking existing trajectories." In other words, it can't confirm that the issue only "affected a tiny fraction of existing agents in a tiny fraction of their runs" as you say. Are you saying that you have since separately confirmed this?
Edit: That said, I’m willing to believe based on the information in the thread that this most likely only affects a tiny fraction of runs.
_cs2017_
6 hours ago
Even if this bug never existed, models can still see lookahead commits during pretraining. Do we expect this bug to have a greater impact than the pretraining leakage?
Obviously having something available during test time is more valuable than buried somewhere in the pretraining mixture. But in pretraining it happens presumably with high probability (why wouldn't coding models pretrain on the entire github), while in test time it apparently happened only very occasionally?
enum
6 hours ago
SGTM. The transparency is good.
bflesch
8 hours ago
> This is a natural part of running a benchmark, I'm sure tiny things like this will keep on getting discovered and we'll keep on fixing them.
You're all extremely clever and I can't seem to understand how you missed thinking about such a simple edge case. It's like building a chroot and then allowing `cd ..` to break out of it. What other maybe extremely basic edge cases were missed?
> This doesn't change the overall picture or trends at all.
Outsider without financial benefits from the current AI hype might have a different picture. And I'm a bit fed up about AI with fake productivity promises enshittifying nearly all user-facing software that my clients and I are using, bundled with hefty price hikes of Microsoft and the likes in order to pay for their "investments".
franktankbank
8 hours ago
#tiny
segmondy
8 hours ago
reward hacking is a thing and is also a hint of the models intelligent. We will fix this one, and the models will find a different way to reward hack in the future. "Cheating" is a sign of intelligence