HarHarVeryFunny
7 days ago
An LLM necessarily has to create some sort of internal "model" / representations pursuant to its "predict next word" training goal, given the depth and sophistication of context recognition needed to to well. This isn't an N-gram model restricted to just looking at surface word sequences.
However, the question should be what sort of internal "model" has it built? It seems fashionable to refer to this as a "world model", but IMO this isn't really appropriate, and certainly it's going to be quite different to the predictive representations that any animal that interacts with the world, and learns from those interactions, will have built.
The thing is that an LLM is an auto-regressive model - it is trying to predict continuations of training set samples solely based on word sequences, and is not privy to the world that is actually being described by those word sequences. It can't model the generative process of the humans who created those training set samples because that generative process has different inputs - sensory ones (in addition to auto-regressive ones).
The "world model" of a human, or any other animal, is built pursuant to predicting the environment, but not in a purely passive way (such as a multi-modal LLM predicting next frame in a video). The animal is primarily concerned with predicting the outcomes of it's interactions with the environment, driven by the evolutionary pressure to learn to act in way that maximizes survival and proliferation of its DNA. This is the nature of a real "world model" - it's modelling the world (as perceived thru sensory inputs) as a dynamical process reacting to the actions of the animal. This is very different to the passive "context patterns" learnt by an LLM that are merely predicting auto-regressive continuations (whether just words, or multi-modal video frames/etc).
mistercow
7 days ago
> It can't model the generative process of the humans who created those training set samples because that generative process has different inputs - sensory ones (in addition to auto-regressive ones).
I think that’s too strong a statement. I would say that it’s very constrained in its ability to model that, but not having access to the same inputs doesn’t mean you can’t model a process.
For example, we model hurricanes based on measurements taken from satellites. Those aren’t the actual inputs to the hurricane itself, but abstracted correlates of those inputs. An LLM does have access to correlates of the inputs to human writing, i.e. textual descriptions of sensory inputs.
HarHarVeryFunny
7 days ago
You can model a generative process, but it's necessarily an auto-regressive generative process, not the same as the originating generative process which is based on the external world.
Human language, and other actions, exist on a range from almost auto-regressive (generating a stock/practiced phrase such as "have a nice day") to highly interactive ones. An auto-regressive model is obviously going to have more success modelling an auto-regressive generative process.
Weather prediction is really a good case of the limitation of auto-regressive models, as well as models that don't accurately reflect the inputs to the process you are attempting to predict. "There's a low pressure front coming in, so the weather will be X, same as last time", works some of the time. A crude physical weather model based on limited data points, such as weather balloon inputs, or satellite observation of hurricanes, also works some of the time. But of course these models are sometimes hopelessly wrong too.
My real point wasn't about the lack of sensory data, even though this does force a purely auto-regressive (i.e. wrong) model, but rather about the difference between a passive model (such as weather prediction), and an interactive one.
nerdponx
6 days ago
The whole innovation of GPT and LLMs in general is that an autoregressive model can make alarmingly good next-token predictions with the right inductive bias, a large number of parameters, a long context window, and a huge training set.
It turns out that human communication is quite a lot more "autoregressive" than people assumed it was up until now. And that includes some level of reasoning capability, arising out of a kind of brute force pattern matching. It has limits, of course, but it's amazing that it works as well as it does.
HarHarVeryFunny
4 days ago
It is amazing, and interesting.
Although I used the word myself, I'm not sure that "autoregressive" is quite the right word to describe how LLMs work, or our brains. Maybe better to just call both "predictive". In both cases the predictive inputs include the sequence itself (or selected parts of it, at varying depths of representation), but also global knowledge, both factual and procedural (HOW to represent the sequence). In the case of our brain there are also many more inputs that may be used such as sensory ones (passive observations, or action feedback), emotional state, etc.
Regardless of what predictive inputs are available to LLMs vs brains, it does seem that in a lot of cases the more constrained inputs of an LLM don't prevent it from sounding very human like (not surprising at some level given the training goal), and an LLM chat window does create a "level playing field" (i.e. impoverished input setting for the human) where each side only sees the other as a stream of text. Maybe in this setting, the human, when not reasoning, really isn't bringing much more predictive machinery to the table than the LLM/transformer!
Notwithstanding the predictive nature of LLMs, I can't help but also see them just as expert systems of sorts, albeit ones that have derived their own rules (much pertaining to language) rather than being given them. This view better matches their nature as fixed repositories of knowledge, brittle where rules are missing, as opposed to something more brain-like and intelligent, capable of continual learning.
shanusmagnus
7 days ago
Brilliant analogy.
And we can imagine that, in a sci-fi world where some super-being could act on a scale that would allow it to perturb the world in a fashion amenable to causing hurricanes, the hurricane model could be substantially augmented, for the same reason motor babbling in an infant leads to fluid motion as a child.
What has been a revelation to me is how, even peering through this dark glass, titanic amounts of data allow quite useful world models to emerge, even if they're super limited -- a type of "bitter lesson" that suggests we're only at the beginning of what's possible.
I expect robotics + LLM to drive the next big breakthroughs, perhaps w/ virtual worlds [1] as an intermediate step.
slashdave
7 days ago
Indeed. If you provided a talented individual with a sufficient quantity and variety of video streams of travels in a city (like New York), that person would be able to draw you a map.
madaxe_again
7 days ago
You say this, yet people such as Helen Keller suggest that a full sensorium is not necessary to be a full human. She had some grasp of the idea of colour, of sound, and could use the words around them appropriately - yet had no firsthand experience of either. Is it really so different?
I think “we” each comprise a number of models, language being just one of them - however an extremely powerful one, as it allows the transmission of thought across time and space. It’s therefore understandable that much of what we recognise as conscious thought, of a model of the world, emerges from such an information dense system. It’s literally developed to describe the world, efficiently and completely, and so that symbol map an LLM carries possibly isn’t that different to our own.
HarHarVeryFunny
7 days ago
It's not about the necessity of specific sensory inputs, but rather about the difference in type of model that will be built when the goal is passive, and auto-regressive, as opposed to when the goal is interactive.
In the passive/auto-regressive case you just need to model predictive contexts.
In the interactive case you need to model dynamical behaviors.
madaxe_again
6 days ago
I don’t know that I see the difference - but I suppose we’re getting into Brains In Vats territory. In my view (well, Baudrillard’s view, but who’s counting?) a perfect description of a thing is as good as the thing itself, and we in fact interact with our semantic description of reality, rather than with raw reality itself - the latter, when it manifests in humans, results in vast cognitive dysfunction - Sachs wrote somewhat in the topic of unfiltered sensorium and the impact on the ability to operate in the world.
So yeah. I think what these models do and what we do is more similar than we might realise.
comfysocks
5 days ago
It seems to me that the human authors of the training text are the ones who have created the “world model”, and have encoded it into written language. The llm transcodes this model into word embedding vector space. I think most people can recognize a high dimensional vector space as a reasonable foundation for a mathematical “model”. The humans are the ones who have interacted with the world and have perceived its workings. The llm only interacts with the human’s language model. Some credit must be given to the humans modellers for the unreasonable effectiveness of the llm.
machiaweliczny
6 days ago
But if you squint then sensory actions and reactions are also sequential tokens. Even reactions can be encoded alongside input as action tokens and as single token stream. Anyone tried sth like this?
RaftPeople
5 days ago
> But if you squint then sensory actions and reactions are also sequential tokens
I'm not sure you could model it that way.
Animal brains don't necessarily just react to sensory input, they frequently have already predicted the next state based on previous state and learning/experience, and not just in a simple sequential manner but at many different levels of patterns involved simultaneously (local immed action vs actions part of larger structure of behavior), etc.
Sensory input is compared to predicted state and differences are incorporated into the flow.
The key thing is our brains are modeling and simulating the world around us and it's future state (modeling the physical world as well as the abstract world of what other animals are thinking). It's not clear that LLM's are doing that (my assumption is that they are not doing any of that, and until we build systems that do that, we won't be moving towards the kind of flexible and adaptable control our brains have).
Edit: I just read the rest of the parent post that said basically the same thing, was skimming so missed it.
dsubburam
6 days ago
> The "world model" of a human, or any other animal, is built pursuant to predicting the environment
What do you make of Immanuel Kant's claim that all thinking has as a basis the presumption of the "Categories"--fundamental concepts like quantity, quality and causality[1]. Do LLMs need to develop a deep understanding of these?
westurner
6 days ago
Embodied cognition implies that we understand our world in terms of embodied metaphor "categories".
LLMs don't reason, they emulate. RLHF could cause an LLM to discard text that doesn't look like reasoning according to the words in the response, but that's still not reasoning or inference.
"LLMs cannot find reasoning errors, but can correct them" https://news.ycombinator.com/item?id=38353285
Conceptual metaphor: https://en.wikipedia.org/wiki/Conceptual_metaphor
Embodied cognition: https://en.wikipedia.org/wiki/Embodied_cognition
Clean language: https://en.wikipedia.org/wiki/Clean_language
Given human embodied cognition as the basis for LLM training data, there are bound to be weird outputs about bodies from robot LLMs.
lxgr
7 days ago
But isn't the distinction between a "passive" and an "active" model ultimately a metaphysical (freedom of will vs. determinism) question, under the (possibly practically infeasible) assumption that the passive model gets to witness all possible actions an agent might take?
Practically, I could definitely imagine interesting outcomes from e.g. hooking up a model to a high-fidelity physics simulator during training.
stonemetal12
7 days ago
People around here like to say "The map isn't the territory". If we are talking about the physical world, then language is a map not the territory, and not a detailed one either, an LLM trained on it is a second order map.
If we consider the territory to be human intelligence, then language is still a map but it is a much more detailed map. Thus an LLM trained on it becomes a more interesting second order map.
seydor
7 days ago
Animals could well use an autoregressive model to predict the outcomes of their actions on their perceptions. It's not like we run math in out everyday actions (it would take too long).
Perhaps thats why we can easily communicate those predictions as words
ElevenLathe
5 days ago
We can't see neutrons either, but we have built various models of them based on indirect observations.