mrandish
7 days ago
I'm not even following AI model performance testing that closely but I'm hearing increasing reports they're inaccurate due to accidental or intentional test data leaking into training data and other ways of training to the test.
Also, ARC AGI reported they've been unable to independently replicate OpenAI's claimed breakthrough score from December. There's just too much money at stake now to not treat all AI model performance testing as an adversarial, no-holds-barred brawl. The default assumption should be all entrants will cheat in any way possible. Commercial entrants with large teams of highly-incentivized people will search and optimize for every possible advantage - if not outright cheat. As a result, smaller academic, student or community teams working part-time will tend to score lower than they would on a level playing field.
malisper
6 days ago
> Also, ARC AGI reported they've been unable to independently replicate OpenAI's claimed breakthrough score from December
Can you elaborate on this? Where did ARC AGI report that? From ARC AGI[0]:
> ARC Prize Foundation was invited by OpenAI to join their “12 Days Of OpenAI.” Here, we shared the results of their first o3 model, o3-preview, on ARC-AGI. It set a new high-water mark for test-time compute, applying near-max resources to the ARC-AGI benchmark.
> We announced that o3-preview (low compute) scored 76% on ARC-AGI-1 Semi Private Eval set and was eligible for our public leaderboard. When we lifted the compute limits, o3-preview (high compute) scored 88%. This was a clear demonstration of what the model could do with unrestricted test-time resources. Both scores were verified to be state of the art.
That makes it sound like ARC AGI were the ones running the original test with o3
What they say they haven't been able to reproduce is o3-preview's performance with the production versions of o3. They attribute this to the production versions being given less compute than the versions they ran in the test
user
7 days ago
godelski
7 days ago
> inaccurate due to accidental or intentional test data leaking into training data and other ways of training to the test.
Even if you assume no intentional data leakage it is fairly easy to accidentally do it. Defining good test data is hard. Your test data should be disjoint from training, which even exact deduplication is hard. But your test data should belong to the same target distribution BUT be sufficiently distant from your training data in order to measure generalization. This is ill-defined in the best of cases, and ideally you want to maximize the distance between training data and test data. But high dimensional settings mean distance is essentially meaningless (you cannot distinguish the nearest from the furthest).Plus there's standard procedures that are explicit data leakage. Commonly people will update hyperparameters to increase test results. While the model doesn't have access to the data, you are passing along information. You are the data (information) leakage. Meta information is still useful to machine learning models and they will exploit it. That's why there's things like optimal hyper-parameters, initialization schemes that lead to better solutions (or mode collapse), and even is part of the lottery ticket hypothesis.
Measuring is pretty messy stuff, even in the best of situations. Intentional data leakage removes all sense of good faith. Unintentional data leakage stresses the importance of learning domain depth, and is one of the key reasons learning math is so helpful to machine learning. Even the intuition can provide critical insights. Ignoring this fact of life is myopic.
> smaller academic, student or community teams working part-time will tend to score lower than they would on a level playing field.
It is rare for academics and students to work "part-time". I'm about to defend my PhD (in ML) and I rarely take vacations and rarely work less than 50hrs/wk. This is also pretty common among my peers.But a big problem is that the "GPU Poor" notion is ill-founded. It ignores a critical aspect of the research and development cycle: basic research. You might see this in something like NASA TRL[0]. Classically academics work predominantly in the low level TRLs, but there's been this weird push in ML (and not too uncommon in CS in general) for placing a focus on products rather than expanding knowledge/foundations. While TRL 1-4 have extremely high failure rates (even between steps), they lay the foundation that allows us to build higher TRL things (i.e. products). This notion that you can't do small scale (data or compute) experiments and contribute to the field is damaging. It sets us back. It breeds stagnation as it necessitates narrowing of research directions. You can't be as risky! The consequence of this can only lead to a Wiley Coyote type moment, where we're running and suddenly find there is no ground beneath us. We had a good thing going. Gov money funds low level research which has higher risks and longer rates of return, but the research becomes public and thus provides foundations for others to build on top of.
[0] https://www.nasa.gov/directorates/somd/space-communications-...
mrandish
7 days ago
> It is rare for academics and students to work "part-time".
Sorry, that phrasing didn't properly convey my intent, which was more that most academics, students and community/hobbyists have other simultaneous responsibilities which they must balance.
godelski
6 days ago
Thanks for the clarification. I think this makes more sense, but I think I need to push back a tad. It is a bit messier (for academia, I don't disagree for community/hobbyists)
In the US PhD system usually PhD students take classes during the first two years and this is often where they serve as teaching assistants too. But after quals (or whatever) you advance to PhD Candidate you no longer take classes and frequently your funding comes through grants or other areas (but may include teaching/assisting. Funding is always in flux...). For most of my time, and is common for most PhDs in my department, I've been on research. While still classified as 0.49 employee and 0.51 student, the work is identical despite categorization.
My point is that I would not generalize this notion. There's certainly very high variance, but I think it is less right than wrong. Sure, I do have other responsibilities like publishing, mentoring, and random bureaucratic administrative stuff, but this isn't exceptionally different from when I've interned or the 4 years I spent working prior to going to grad school.
Though I think something that is wild about this system (and generalizes outside academia), is that this completely flips when you graduate from PhD {Student,Candidate} to Professor. As a professor you have so much auxiliary responsibilities that most do not have time for research. You have to teach, do grant writing, there is a lot of department service (admins seem to increase this workload, not decrease...), and other stuff. It seems odd to train someone for many years and then put them in... a essentially a administrative or managerial role. I say this generalizes, because we do the same thing outside academia. You can usually only get promoted as an engineer (pick your term) for so long before you need to transition to management. Definitely I want technical managers, but that shouldn't prevent a path for advancement through technical capabilities. You spent all that time training and honing those skills, why abandon them? Why assume they transfer to the skills of management? (some do, but enough?). This is quite baffling to me and I don't know why we do this. In "academia" you can kinda avoid this by going to post-doc or a government labs, or even the private sector. But post-doc and private sector just delay this transition and government labs are hit or miss (but this is why people like working there and will often sacrifice salaries).
(The idea in academia is you then have full freedom once you're tenured. But it isn't like the pressures of "publish or perish" disappear, and it is hard to break habits. Plus, you'd be a real dick if you are sacrificing your PhD students' careers in pursuit of your own work. So the idealized belief is quite inaccurate. If anything, we want young researchers to be attempting riskier research)
TLDR: for graduate students, I disagree; but, for professors/hobbyists/undergrads/etc, I do agree