woeirua
22 days ago
It's just amazing to me how fast the goal posts are moving. Four years ago, if you had told someone that a LLM would be able to one-shot either of those first two tasks they would've said you're crazy. The tech is moving so fast. I slept on Opus 4.5 because GPT 5 was kind of an air ball, and just started using it in the past few weeks. It's so good. Way better than almost anything that's come before it. It can one-shot tasks that we never would've considered possible before.
skue
22 days ago
> Four years ago, if you had told someone that a LLM would be able to one-shot either of those first two tasks they would've said you're crazy.
Four years ago, they would have likely asked what in the world is an LLM? ChatGPT is barely 3 years old.
enraged_camel
22 days ago
It literally saved my small startup six-figures and months of work. I've written about it extensively and posted it (it's in my submissions).
ranyume
22 days ago
There are certain things/llm-phenomena that haven't changed since their introduction.
Madmallard
22 days ago
Idk I was using chat gpt 3.5 to do stuff and it was pretty helpful then
utopiah
22 days ago
> The tech is moving so fast.
Well that's exactly the problem : how can one say that?
The entire process of evaluating what "it" actually does has been a problem from the start. Input text, output text ... OK but what if the training data includes the evaluation? This was ridiculous few years ago but then the scale went from some curated text datasets to... most of the Web as text, to most of the Web as text including transcription from videos, to most of the Web plus some non public databases, to all that PLUS (and that's just cheating) tests that were supposed to be designed to NOT be present elsewhere.
So again, that's the crux of the problem, WHAT does it actually do? Is it "just" search? Is it semantic search with search and replace, is it that plus evaluation that it runs?
Sure the scaffolding becomes bigger, the available dataset becomes larger, the compute available keeps on increasing but it STILL does not answer the fundamental question, namely what is being done. The assumption here is because the output text does solve the question ask, then "it" works, it "solved" the problem. The problem is that by definition the entire setup has been made in order to look as plausible as possible. So it's not luck that it initially appears realistic. It's not luck that it can thus pass some dedicated benchmark, but it is also NOT solving the problem.
So yes sure the "tech" is moving "so fast" but we still can't agree on what it does, we keep on having no good benchmarks, we keep on having that jagged frontier https://www.hbs.edu/faculty/Pages/item.aspx?num=64700 that makes it so challenging to make more meaningful statement than "moving so fast" which sounds like marketing claims.
computerex
22 days ago
You know LLM's have been used to solve very hard previously unsolved math problems like some of the Erdos problems?
patagurbon
22 days ago
That Erdos problem solution is believed by quite a few to be a previous result found in the literature, just used in a slightly different way. It also seems not a lack of progress but simply no one cared to give it a go.
That’s a really fantastic capability, but not super surprising.
bakkoting
21 days ago
You're thinking of a previous report from a month ago, #897 or #481, or the one from two weeks ago, #728. There's a new one from a week ago, #205, which is genuinely novel, although it is still a relatively "shallow" result.
Terence Tao maintains a list [1] of AI attempts (successful and otherwise). #205 is currently the only success in section 1, the "full solution for which subsequent literature review did not find new relevant prior partial or full solutions" section - but it is in that section.
As to speed, as far as I know the recent results are all due to GPT 5.2, which is barely a month old, or Aristotle, which is a system built on top of some frontier LLMs and which has only been accessible to the public for a month or two. I have seen multiple mathematicians report that GPT-5.2 is a major improvement in proof-writing, e.g. [2]
[1] https://github.com/teorth/erdosproblems/wiki/AI-contribution...
utopiah
21 days ago
Thanks for the wiki link, very interesting, in particular
- the long tail aspect of the problem space ; 'a "long tail" of under-explored problems at the other, many of which are "low hanging fruit" that are very suitable for being attacked by current AI tools'
- the expertise requirement, literature review but also 'Do I understand what the key ideas of the solution are, and how the hypotheses are utilized to reach the conclusion?' so basically one must already be an expert (or able to become one) to actually use this kind of tooling
and finally the outcomes which taking into consider the previous 2 points makes it very different from what most people would assume as "AI contributions".
utopiah
21 days ago
I do, and I read Tao's comments on his usage too, that still doesn't address what I wrote.
computerex
21 days ago
How does it not address what you wrote?
utopiah
21 days ago
If I understood correctly you are giving an example of a "success" of using the technology. So that's addressing that the technology is useful or not, powerful or not, but it does not address what it actually does (maybe somebody in ChatGPT is a gnome that solved it, I'm just being provocative here to make the point) or more important that it does something it couldn't do a year ago or 5 years ago because how it is doing something new.
For example if somebody had used GPT2 with the input dataset of GPT5.2 (assuming that's the one used for Erdos problems) rather than the input dataset it had then, could it have solved those same problems? Without doing such tests it's hard to say if it moved fast, or at all. It's not because something new has been solved by it that it's new. Yes it's a reasonable assumption, but it's just that. So going for that to assuming "it" is "moving fast" is just a belief IMHO.
utopiah
21 days ago
Also something that makes the whole process very hard to verify is what I tried to address in a much older comment : whenever LLMs are used (regardless of the input dataset) by someone who is an expert in the domain (rather than an novice) how can one evaluate what's been done by whom or what? Sure again there can be a positive result, e.g a solution to a problem until now unsolved, what does it say about the tool itself versus a user who is, by definition if they are an expert, up to date on the state of thew art?
utopiah
21 days ago
Also the very fact that https://github.com/teorth/erdosproblems/wiki/AI-contribution... exist totally change the landscape. Because it's public it's safe to assume it's part of the input dataset so from now on, how does one evaluate the pace of progress, in particular for non open source models?