anon373839
6 days ago
I remain skeptical of emergent properties in LLMs in the way that people have used that term. There was a belief 3-4 years ago that if you just make the models big enough, they magically acquire intelligence. But since then, we’ve seen that the models are actually still pretty limited by the training data: like other ML models, they interpolate well between the data they’ve been trained on, but they don’t generalize well beyond it. Also, we have seen models that are 50-100x smaller now exhibit the same “emergent” capabilities that were once thought to require hundreds of billions of parameters. I personally think the emergent properties really belong to the data instead.
andy99
6 days ago
Yes, deep learning models only interpolate, and essentially represent an effective way of storing data labeling effort. Doesn't mean they're not useful, just not what tech adjacent promoters want people to think.
john-h-k
6 days ago
> Yes, deep learning models only interpolate
What do you mean by this? I don’t think the understanding of LLMs is sufficient to make this claim
andy99
6 days ago
An LLM is a classifier, there is lots of research into how deep learning classifiers work, that I haven't seen contradicted when applied to LLMs.
drdeca
6 days ago
I still think it seems unclear what you mean by “interpolate” in this context? If your NN takes in several numbers and assigns logits to each class based on those numbers, then if you consider the n dimensional space of possible inputs, and if the new input is in the convex hull of the inputs that appear in training samples, then the meaning of “interpolate” is fairly clear.
But when the inputs are sequences of tokens…
Granted, each token gets embedded as some vector, and you can concatenate those vectors to represent the sequence of tokens as one big vector, but, are these vectors for novel strings in the convex hull of such vectors for the strings in the training set?
bunderbunder
6 days ago
The answer is kind of right there in the start of your last sentence. From the transformer model's perspective, the input is just a time series of vectors. It ultimately isn't any different from any other time series of vectors.
Way back in the day when I was working with latent Dirichlet allocation models, I had a minor enlightenment moment when I realized that the models really weren't capturing any semantically meaningful relationships. They were only capturing meaningless statistical correlations to which I would then assign semantic value so effortlessly and automatically that I didn't even realize it was always me doing it, never the model.
I'm pretty sure LLMs exist on that same continuum. And if you travel down it in the other direction, you get to simple truisms such as "correlation does not equal causation."
drdeca
6 days ago
The part about “is it in the convex hull?” was an important part of the question.
It seems to me that if it isn’t in the convex hull, it could be more fitting to describe it as extrapolation, rather than interpolation?
In general, my question does apply to the task of predicting how a time series of vectors continues: Given a dataset of time series, where the dimension of each vector in the series is such and such, the length of each series is yea long, and there are N series in the training set, should we expect series in the test set or validation set to be in the convex hull of the ones in the training set?
I would think that the number of series in the training set, N, while large, might not be all that large compared to the dimensionality of a whole series?
Hm, are there efficient techniques for evaluating whether a high dimensional vector is in the convex hull of a large number of other high dimensional vectors?
bunderbunder
6 days ago
Just shooting from the hip, LLMs operate out on a frontier where the curse of dimensionality removes a large chunk of the practical value from the concept of a convex hull. Especially in a case like this where the vector embedding process places hard limits on the range of possible magnitudes and directions for any single vector.
drdeca
4 days ago
Outside of the context of a convex hull, I don’t know how to make a distinction between interpolation and extrapolation. This is the core of my question.
What precisely is it that you mean when you say that it is interpolating rather than extrapolating? In the only definition that I know, the one based on convex hulls, I believe it would be extrapolating rather than interpolating. But people often say it is interpolating rather than extrapolating, and I don’t know what they mean.
bunderbunder
3 days ago
I doubt they're really thinking about it in a mathematical sense when they say that. I'm guessing, for example, that "extrapolate" is meant in the more colloquial sense, which is maybe closer to "deduce" in practice.
kevinsync
6 days ago
My hot take is that what some people are labeling as "emergent" is actually just "incidental encoding" or "implicit signal" -- latent properties that get embedded just by nature of what's being looked at.
For instance, if you have a massive tome of English text, a rather high percentage of it will be grammatically-correct (or close), syntactic and understandable, because humans who speak good English took the time to write it and wrote it how other humans would expect to read or hear it. This, by its very nature, embeds "English language" knowledge due to sequence, word choice, normally-hard-to-quantify expressions (colloquial or otherwise), etc.
When you consider source data from many modes, there's all kinds of implicit stuff that gets incidentally written.. for instance, real photographs of outer space or deep sea would only show humans in protective gear, not swimming next to the Titanic. Conversely, you won't see polar bears eating at Chipotle, or giant humans standing on top of mountains.
There's a statistical probability of "this showed up enough in the training data to loosely confirm its existence" / "can't say I ever saw that, so let's just synthesize it" aspect of the embeddings that one person could interpret as "emergent intelligence", while another could just-as-convincingly say it's probabilistic output that is mostly in line with what we expect to receive. Train the LLM on absolute nonsense instead and you'll receive exactly that back.
atoav
6 days ago
Emergent as I have known and used it before is when more complex behavior emerges from simple rules.
My goto example for this was Game of Life, where from very simple rules, a very organically behaving (turing complete) system emerges. Now Game of Life is a deterministic system, meaning that the same rules and the same start-configurarion will play out in exactly the same way each time — but given the simplicity of the logic and the rules the resulting complexity is what I'd call emergent.
So maybe this is more about the definition of what we'd call emergent and what not.
As someone who has programmed markov chains where the stochastic interpolation really shines through, transformer-based LLMS definitly show some emergent behavior one wouldn't have immediately suspected just from the rules. Emergent does not mean "conscious" or "self-reflective" or anything like that. But the things a LLM can infer from its training data is already quite impressive.
gond
6 days ago
Interesting. Is there a quantitative threshold to emergence anyone could point at with these smaller models? Tracing the thoughts of a large language model is probably the only way to be sure, or is it?
gond
6 days ago
Disregarding the downvotes, I mean this as a serious question.
From the liked article: “We don’t know an “algorithm” for this, and we can’t even begin to guess the required parameter budget or the training data needed.”
Why not, at least the external ones? The computational resources and the size of the training dataset is quantifiable from an input point of view. What gets used is not, but the input size should.