cwyers
3 days ago
The lack of transparency here is wild. They aggregate the scores of the models they test against, which obscures the performance. They only release results on their own internal benchmark that they won't release. They talk about RL training but they don't discuss anything else about how the model was trained, including if they did their own pre-training or fine-tuned an existing model. I'm skeptical of basically everything claimed here until either they share more details or someone is able to interpedently benchmark this.
criemen
3 days ago
I understand where you're coming from, and I'd love to have learned about pre-training vs. off-the-shelf base model too. But
> their own internal benchmark that they won't release
If they'd release their internal benchmark suite, it'd make it into the training set of about every LLM, which from a strictly scientific standpoint, invalidates all conclusions drawn from that benchmark from then on. On the other hand, not releasing the benchmark means they could've hand-picked the datapoints to favor them. It's a problem that can't be resolved unfortunately.
cwyers
3 days ago
I'm not saying SWE-Bench is perfect, and there are reports that suggest there is some contamination of training sets for LLMs with common benchmarks like SWE-Bench. But they publish SWE-bench so anyone can run it and have an open leaderboard where they attribute the results to specific models, not just vague groupings:
ARC-AGI-2 keeps a private set of questions to prevent LLM contamination, but they have a public set of training and eval questions so that people can both evaluate their modesl before submitting to ARC-AGI and so that people can evalute what the benchmark is measuring:
https://github.com/arcprize/ARC-AGI-2
Cursor is not alone in the field in having to deal with issues of benchmark contamination. Cursor is an outlier in sharing so little when proposing a new benchmark while also not showing performance in the industry standard benchmarks. Without a bigger effort to show what the benchmark is and how other models perform, I think the utility of this benchmark is limited at best.
nickpsecurity
3 days ago
In high-security systems, we solved this problem with trusted, independent evaluators who got all the data. They replicate the results themselves. They analyze every artifact for flaws. They also pen test the system offensively. If they say it's good, then maybe it is good or maybe less, obviously bad.
We could have third-party groups with evaluation criteria who don't make models or sell A.I.. Strictly evaluators. Alternatively, they have a different type of steady income with the only A.I. work they're doing being evaluation.
infecto
2 days ago
Disagree. The ultimate bar which is easily measurable, do users find value in it. Benchmarks are mostly meaningless especially in my opinion where cursor shines which is the tool chain. You can go try composer yourself today and see if it’s valuable to you.
diggan
2 days ago
Isn't that up to the reader/visitor/user to decide? As it stands right now, Cursor are publishing results they won't say how they got them, and compares them against aggregate scores we don't know the true results of, and you're saying "it doesn't matter, the tool is better anyways".
Then why publish the obscured benchmarks in the first place then?
infecto
2 days ago
No I said I don’t believe any of the existing benchmarks do well when it comes to using a tool chain. They built a model specifically to be used with their tool chain calls, something that a lot of the models out there struggle with.
NitpickLawyer
3 days ago
Does it really matter tho? At the end of the day, what matters most is if real users find it useful or not. And cursor has that data (both historically and in real-time). Thousands of accepts/rejects >>> any benchmark that you can come up with. That should allow them to iterate on it, and make it better, eventually.
Benchmarks have become less and less useful. We have our own tests that we run whenever a new model comes out. It's a collection of trivial -> medium -> hard tasks that we've gathered, and it's much more useful to us than any published table. And it leads to more interesting finds, such as using cheaper models (5-mini, fast-code-1, etc) on some tasks vs. the big guns on other tasks.
I'm happy to see cursor iterate, as they were pretty vulnerable to the labs leaving them behind when all of them came out with coding agents. The multi-agents w/ built in git tree support is another big thing they launched recently. They can use their users as "teacher models" for multiple completions by competing models, and by proxying those calls, they get all the signals. And they can then use those signals to iterate on their own models. Cool stuff. We actually need competing products keeping eachother in check, w/ the end result being more options for us, and sometimes even cheaper usage overall.