0x20cowboy
7 hours ago
> LLMs are PhD-level reasoners in math and science, yet they fail at children's puzzles. How is this possible?
Because they are not.
Pattern matching questions on a contrived test is not the same thing as understanding or reasoning.
It’s the same reason why most of the people who pass your leetcode tests don’t actually know how to build anything real. They are taught to the test not taught to reality.
gwd
5 hours ago
> Pattern matching questions on a contrived test is not the same thing as understanding or reasoning.
Do submarines swim? I don't really care if it gets me where I want to go. The fact is that just two days ago, I asked Claude to look at some reasonably complicated concurrent code to which I had added a new feature, and asked it to list what tests needed to be added; and then when I asked GPT-5 to add them, it one-shot nailed the implementations. I've written a gist of it here:
https://gitlab.com/-/snippets/4889253
Seriously just even read the description of the test it's trying to write.
In order to one-shot that code, it had to understand:
- How the cache was supposed to work
- How conceptually to set up the scenario described
- How to assemble golang's concurrency primitives (channels, goroutines, and waitgroups), in the correct order, to achieve the goal.
Did it have a library of concurrency testing patterns in its head? Probably -- so do I. Had it ever seen my exact package before in its training? Never.
I just don't see how you can argue with a straight face that this is "pattern matching". If that's pattern matching, then pattern matching is not an insult.
If anything, the examples in this article are the opposite. Take the second example, which is basically 'assemble these assorted pieces into a rectangle'. Nearly every adult has assembled a minimum of dozens of things in their lives; many have assembled thousands of things. So it's humans in this case who are simply "pattern matching questions on a contrived test", and the LLMs, which almost certainly didn't have a lot of "assemble these items" in their training data, that are reasoning out what's going on from first principles.
HarHarVeryFunny
2 hours ago
> Do submarines swim?
It doesn't matter HOW LLMs "swim" as long as they can, but the point being raised is whether they actually can.
It's as if LLMs can swim in the ocean, in rough surf, but fail to swim in rivers or swimming pools, because they don't have a generalized ability to swim - they've just been RL-trained on the solution steps to swimming in surf, but since those exact conditions don't exist in a river (which might seem like a less challenging environment), they fail there.
So, the question that might be asked is when LLMs are trained to perform well in these vertical domains like math and programming, where it's easy to verify results and provide outcome- or process-based RL rewards, are they really learning to reason, or are they just learning to pattern match to steer generation in the direction of problem-specific reasoning steps that they had been trained on?
Does the LLM have the capability to reason/swim, or is it really just an expert system that has been given the rules to reason/swim in certain cases, but would need to be similarly hand fed the reasoning steps to be successful in other cases?
I think the answer is pretty obvious given that LLM's can't learn at runtime - can't try out some reasoning generalization they may have arrived at, find that it doesn't work in a specific case, then explore the problem and figure it out for next time.
Given that it's Demis Hassabis who it pointing out this deficiency of LLMs (and has a 5-10 year plan/timeline to fix it - AGI), not some ill-informed LLM critic, it seems silly to deny it.
gwd
an hour ago
> I think the answer is pretty obvious given that LLM's can't learn at runtime - can't try out some reasoning generalization they may have arrived at, find that it doesn't work in a specific case, then explore the problem and figure it out for next time.
This is just a problem of memory. Supposing that an LLM did generate a genuinely novel insight, it could in theory they could write a note for itself so that next time they come online, they can read through a summary of the things they learned. And it could also write synthetic training data for itself so that the next time they're trained, that gets incorporated into its general knowledge.
OpenAI allows you to fine-tune GPT models, I believe. You could imagine a GPT system working for 8 hours in a day, then spending a bunch of time looking over all its conversation looking for patterns or insights or things to learn, and then modifying its own fine-tuning data (adding, removing, or modifying as appropriate), which it then used to train itself overnight, waking up the next morning having synthesized the previous day's experience.
HarHarVeryFunny
a minute ago
> This is just a problem of memory
How does memory (maybe later incorporated via fine tuning) help if you can't figure out how to do something in the first place ?
That would be a way to incorporate new declarative data at "runtime" - feedback to the AI intern as to what it is doing wrong. However, in order to do something effectively by yourself generally requires more than just new knowledge - it requires personal practice/experimentation etc, since you need to learn how to act based on the contents of your own mind, not that of the instructor.
Even when you've had enough practice to become proficient at a taught skill, you my not be able to verbalize exactly what you are doing (which is part of the teacher-student gap), so attempting to describe then capture that as textual/context "sensory input" is not always going to work.
naasking
an hour ago
> are they really learning to reason, or are they just learning to pattern match to steer generation in the direction of problem-specific reasoning steps that they had been trained on?
Are you sure there's a real difference? Do you have a definition of "reasoning" that excludes this?
mjr00
41 minutes ago
It's trivial to demonstrate that LLMs are pattern matching rather than reasoning. A good way is to provide modified riddles-that-aren't. As an example:
> Prompt: A man working at some white collar job gets an interview scheduled with an MBA candidate. The man says "I can't interview this candidate, he's my son." How is this possible?
> ChatGPT: Because the interviewer is the candidate’s mother. (The riddle plays on the assumption that the interviewer must be a man.)
This is clearly pattern matching and overfitting to the "doctor riddle" and a good demonstration of how there's no actual reasoning going on. A human would read the prompt and initially demonstrate confusion, which LLMs don't demonstrate because they don't actually reason.
naasking
5 minutes ago
> It's trivial to demonstrate that LLMs are pattern matching rather than reasoning.
Again, this is just asserting the assumption that reasoning cannot include pattern matching, but this has never been justified. What is your definition for "reasoning"?
> This is clearly pattern matching and overfitting to the "doctor riddle" and a good demonstration of how there's no actual reasoning going on.
Not really, no. "Bad reasoning" does not entail "no reasoning". Your conclusion is simply too strong for the evidence available, which is why I'm asking for a rigourous definition of reasoning that doesn't leave room for disagreement about whether pattern matching counts.
HarHarVeryFunny
an hour ago
I define intelligence as prediction (degree of ability to use past experience to correctly predict future action outcomes), and reasoning/planning as multi-step what-if prediction.
Certainly if a human (or some AI) has learned to predict/reason over some domain, then what they will be doing it pattern matching to determine the generalizations and exceptions that apply in a given context (including a hypothetical context in a what-if reasoning chain), in order to be able to select a next step that worked before.
However, I think what we're really talking about here isn't the mechanics of applying learnt reasoning (context pattern matching), but rather the ability to reason in the general case, which requires the ability to LEARN to solve novel problems, which is what is missing from LLMs.
A system that has a fixed set of (reasoning/prediction) rules, but can't learn new ones for itself, seems better regarded as an expert system. We need to make the distinction between can a system that can only apply rules, and one that can actually figure out the rules in the first place.
In terms of my definitions of intelligence and reasoning, based around ability to use past experience to learn to predict, then any system that can't learn from fresh experience doesn't meet that definition.
Of course in humans and other intelligent animals the distinction between past and ongoing experience doesn't apply since they can learn continually and incrementally (something that is lacking from LLMs), so for AI we need to use a different vocabulary, and "expert system" seems the obvious label for something that can use rules, but not discover them for itself.
gwd
an hour ago
So I do think there are two distinct types of activities involved in knowledge work:
1. Taking established techniques or concepts and appropriately applying them to novel situations.
2. Inventing or synthesizing new, never-before-seen techniques or concepts
The vast majority of the time, humans do #1. LLMs certainly do this in some contexts as well, as demonstrated by my example above. This to me counts as "understanding" and "thinking". Some people define "understanding" such that it's something only humans can do; to which I respond, I don't care what you call it, it's useful.
Can LLMs do #2? I don't know. They've got such extensive experience that how would you know if they'd invented a technique vs had seen it somewhere?
But I'd venture to argue that most humans never or rarely do #2.
HarHarVeryFunny
26 minutes ago
> But I'd venture to argue that most humans never or rarely do #2.
That seems fair, although the distinction between synthesizing something new and combining existing techniques is a bit blurry.
What's missing from LLMs though is really part of 1). If techniques A, B, C & D are all the tools you need to solve a novel problem, then a human has the capability of learning WHEN to use each of these tools, and in what order/combination, to solve that problem - a process of trial and error, generalization and exception, etc. It's not just the techniques (bag of tools) you need, but also the rules (acquired knowledge) of how they can be used to solve different problems.
LLMs aren't able to learn at runtime from their own experience, so the only way they can learn these rules of when to apply given tools (aka reasoning steps) - is by RL training on how they have been successfully used to solve a range of problems in the training data. So, the LLM may have learnt that in specific context it should first apply tool A (generate that reasoning step), etc, etc, but that doesn't help it to solve a novel problem where the same sequence of solution steps doesn't apply, even if the tools A-D are all it needs (if only it could learn how to apply them to this novel problem).
freejazz
an hour ago
It seems readily apparent there is a difference given their inability to do tasks we would otherwise reasonably describe as achievable via basic reasoning on the same facts.
OtherShrezzing
an hour ago
>and the LLMs, which almost certainly didn't have a lot of "assemble these items" in their training data
I don't think this assumption is sound. Humans write a huge amount on "assemble components x and y to make entity z". I'd expect all LLMs to have consumed every IKEA type instruction manual, the rules for Jenga, all geometry textbooks and papers ever written.
vlovich123
an hour ago
I could be mistaken but generally LLMs cannot tackle out-of-domain problems whereas humans do seem to have that capability. Relatedly, the energy costs are wildly different suggesting that LLMs are imitating some kind of thought but not simulating it. They’re doing a remarkable job of passing the Turing test but that says more about the limitations of the Turing test than it does about the capabilities of the LLMs.
amelius
2 hours ago
Most of our coding is just plumbing. Getting data from one place to where it needs to be. There is no advanced reasoning necessary. Just a good idea of the structure of the code and the data-structures.
Even high school maths tests are way harder than what most professional programmers do on a daily basis.
Akronymus
4 hours ago
> I just don't see how you can argue with a straight face that this is "pattern matching". If that's pattern matching, then pattern matching is not an insult.
IMO its still "just" a, very good, autocomplete. No actual reasoning, but lots of statistics on what is the next token to spit out.
NoahZuniga
4 hours ago
> Do submarines swim?
That's the main point of the parent comment. Arguing about the definition of "reasoning" or "pattern matching" is just a waste of time. What really matters is if it produces helpful output. Arguing about that is way better!
Instead of saying: "It's just pattern matching -> It won't improve the world", make an argument like: "AI's seem to have trouble specializing like humans -> adopting AI will increase error rates in business processes -> due to the amount of possible edge cases, most people will get into an edge case with no hope of escaping it -> many people's lives will get worse".
The first example relies on us agreeing on the definition of pattern matching, and then taking a conclusion based on how those words feel. This has no hope of convincing me if I don't like your definition! The second one is an argument that could potentially convince me, even if I'm an AI optimist. It is also just by itself an interesting line of reasoning.
ozgung
3 hours ago
No it's not "just a very good autocomplete". I don't know why people repeat this thing (it's wrong) but I find it an extremely counterproductive position. Some people just love to dismiss the capabilities of AI with a very shallow understanding of how it works. Why?
It generates words one by one, like we all do. This doesn't mean it does just that and nothing else. It's the mechanics of how they are trained and how they do inference. And most importantly how they communicate with us. It doesn't define what they are or their limits. This is reductionism. Ignoring the mathematical complexity of a giant neural network.
Bjartr
2 hours ago
> like we all do
Do we though? Sure, we communicate sequentially, but that doesn't mean that our internal effort is piecewise and linear. A modern transformer LLM however is. Each token is sampled from a population exclusively dependent on the tokens that came before it.
Mechanistically speaking, it works similarly to autocomplete, but at a very different scale.
Now how much of an unavoidable handicap this incurs, if any, is absolutely up for debate.
But yes, taking this mechanistic truth and only considering it in a shallow manner underestimates the capability of LLMs by a large degree.
kenjackson
2 hours ago
Our thinking is also based only on events that occurred previously in time. We don’t use events in the future.
ElevenLathe
2 hours ago
Is this a certainty? I thought it was an open question whether quantum effects are at play in the brain, and those have a counterintuitive relationship with time (to vastly dumb things down in a way my grug mind can comprehend).
kenjackson
2 hours ago
Well there’s no evidence of this that I’ve seen. If so, then maybe that is what is the blocker for AGI.
karmakaze
3 hours ago
I can't say for certain that our wetware isn't "just a very good autocomplete".
esafak
an hour ago
A very good autocomplete is realized by developing an understanding.
faangguyindia
an hour ago
>Pattern matching questions on a contrived test is not the same thing as understanding or reasoning.
I think most of the problem i solve is also a pattern matching. The problems i am good at solving are the ones i've seen before or the ones i can break into problems i've seen before.
ACCount37
5 hours ago
"Not understanding or reasoning" is anthropocentric cope. There is very little practical difference between "understanding" and "reasoning" implemented in human mind and that implemented in LLMs.
One notable difference, however, is that LLMs disproportionately suck at spatial reasoning. Which shouldn't be surprising, considering that their training datasets are almost entirely text. The ultimate wordcel makes for a poor shape rotator.
All ARC-AGI tasks are "spatial reasoning" tasks. They aren't in any way special. They just force LLMs to perform in an area they're spectacularly weak at. And LLMs aren't good enough yet to be able to brute force through this innate deficiency with raw intelligence.
fumeux_fume
24 minutes ago
For many people, the difference between how a language model solves a problem and how a human solves a problem is actually very important.
HighGoldstein
5 hours ago
> There is very little practical difference between "understanding" and "reasoning" implemented in human mind and that implemented in LLMs.
Source?
ACCount37
4 hours ago
The primary source is: measured LLM performance on once-human-exclusive tasks - such as high end natural language processing or commonsense reasoning.
Those things were once thought to require a human mind - clearly, not anymore. Human commonsense knowledge can be both captured and applied by a learning algorithm trained on nothing but a boatload of text.
But another important source is: loads and loads of mech interpret research that tried to actually pry the black box open and see what happens on the inside.
This found some amusing artifacts - such as latent world models that can be extracted from the hidden state, or neural circuits corresponding to high level abstracts being chained together to obtain the final outputs. Very similar to human "abstract thinking" in function - despite being implemented on a substrate of floating point math and not wet meat.
NooneAtAll3
5 hours ago
...literally benchmarks the post is all about?
practical difference is about results - and results are here
dwallin
2 hours ago
Very much agree with this. Looking at the dimensionality of a given problem space is a very helpful heuristic when analyzing how likely an llm is going to be suitable/reliable for that task. Consider how important positional encodings are LLM performance. You also then have an attention model that operates in that 1-dimensional space. With multidimensional data significant transformations to encode into a higher dimensional abstraction needs to happen within the model itself, before the model can even attempt to intelligently manipulate it.
pessimizer
2 hours ago
> Pattern matching questions on a contrived test is not the same thing as understanding or reasoning.
Pattern matching is definitely the same thing as understanding and reasoning.
The problem is that LLMs can't recognize patterns that are longer than a few paragraphs, because the tokens would have to be far too long. LLMs are a thing we are lucky to have because we have very fast computers and very smart mathematicians making very hard calculations very efficient and parallelizable. But they sit on top of a bed of an enormous amount of human written knowledge, and can only stretch so far from that bed before completely falling apart.
Humans don't use tokenizers.
The goal right now is to build a scaffolding of these dummies in order to get really complicated work done, but that work is only ever going to accidentally be correct because of an accumulation of errors. This may be enough for a lot if we try it 1000x and run manually-tuned algos over the output to find the good ones. But this is essentially manual work, done in the traditional way.
edit: sorry, you're never going to convince me these things are geniuses when I chat to them for a couple of back and forth exchanges and they're already obviously losing track of everything, even what they just said. The good thing is that what they are is enough to do a lot, if you're a person who can be satisfied that they're not going to be your god anytime soon.