jebarker
8 hours ago
> Do they merely memorize training data and reread it out loud, or are they picking up the rules of English grammar and the syntax of C language?
This is a false dichotomy. Functionally the reality is in the middle. They "memorize" training data in the sense that the loss curve is fit to these points but at test time they are asked to interpolate (and extrapolate) to new points. How well they generalize depends on how well an interpolation between training points works. If it reliably works then you could say that interpolation is a good approximation of some grammar rule, say. It's all about the data.
mjburgess
7 hours ago
This only applies to intra-distribution "generalisation", which is not the meaning of the term we've come to associate with science. Here generalisation means across all environments (ie., something generalises if its valid and reliable where valid = measures property, and reliable = under causal permutation to the environment).
Since an LLM does not change in response to the change in meaning of terms (eg., consider the change to "the war in ukraine" over the last 10 years) -- it isn't reliable in the scientific sense. Explaining why it isnt valid would take much longer, but its not valid either.
In any case: the notion of 'generalisation' used in ML just means we assume there is a single stationary distribution of words, and we want to randomly sample from that distribution without bias to oversampling from points identical to the data.
Not least that this assumption is false (there is no stationary distribution), it is also irrelevant to generalisation in traditional sense. Since whether we are biased towards the data or not isn't what we're interested in. We want output to be valid (the system to use words to mean what they mean) and to be reliable (to do so across all environments in which they mean something).
This does not follow from, nor is it even related to, this ML sense of generalisation. Indeed, if LLMs generalised in this sense, they would be very bad at usefully generalising -- since the assumptions here are false.
jebarker
6 hours ago
I don't really follow what you're saying here. I understand that the use of language in the real-world world is not sampled from a stationary distribution, but it also seems plausible that you could relax that assumption in an LLM, e.g. conditioning the distribution on time, and then intra-distribution generalization would still make sense to study how well the LLM works for held-out test samples.
Intra-distribution generalization seems like the only rigorously defined kind of generalization we have. Can you provide any references that describe this other kind of generalization? I'd love to learn more.
ericjang
6 hours ago
intra-distribution generalization is also not well posed in practical real world settings. suppose you learn a mapping f : x -> y. casually, intra-distribution generalization implies that f generalizes for "points from the same data distribution p(x)". Two issues here:
1. In practical scenarios, how do you know if x' is really drawn from p(x)? Even if you could compute log p(x') under the true data distribution, you can only verify that the support for x' is non-zero. one sample is not enough to tell you if x' drawn from p(x).
2. In high dimensional settings, x' that is not exactly equal to an example within the training set can have arbitrarily high generalization error. here's a criminally under-cited paper discussing this: https://arxiv.org/abs/1801.02774
mjburgess
16 minutes ago
Worse even than this: there are no distributions.
What we mean by x ~ p(x), y ~ p(y|x) is not x -> y st. x = f(y)
Reality itself has no probability distributions. Reality follows a causal model, where a causal relation is given in terms of necessity and possibility.
Eg., there is no such thing as Photo ~ P(Photo|PhotoOfCat) to be learned, only (All Causes) -> PhotoOfCat. Thus the setup of ML as y = f(x) is incorrect, there is no `f` which satisfies this formula (in almost all cases).
Consider the LLM case: reality has no P("The War in Ukraine"| TheWarIn2022) -- either the speaker meant TheWarIn2022, or they didnt. There's no sense in which reality has it that the utterance is intrinsically ambiguous (necessarily, for communication to be possible, pragmatics+semantics has to be able to fully resolve meaning).
So what are LLMs learning? Just an implied empirical distribution which is "smoothed over" the data just enough that it "hangs on to it, without repeating it" -- and this is vital, since if it were to try to generalise in the scientific sense, it would cease to be meaningful, since no algorithm which computes P(y|x) in this manner could capture the necessary relata that fully resolves meaning. Any system capable of modelling meaning would be probabilistic only in the sense of having a prior over such causal models: P("TheWarInUkraine"|TheWarIn2022, CausalModel) = 1, but P(CausalModel) < 1
So it's always undefined what it means to "generalise" wrt to an empirical distribution -- there aren't any.
When we say scientific theories generalise, we mean their posited necessary causal relations are maintained across irrelevant interventions. Eg., newton's theory of gravity generalises in that each term (F, M, m, r) is a valid measure of some property, and it remains a valid measure across a very large number of environments.
It fails to generalise for extreme values of M, m, etc.
In the ML sense, all intra-distributional generalisation fails for trivial permutations of any causal property, eg., m+dm -- because this induces an entirely new distribution. The "generalisation error" depends on what m+dm does within our model, but regardless, generalisation fails.
Scientific theories do not fail to generalise in this way, irrelevant causal interventions make no difference to the explanatory adequacy (or predictive power) of the theory.