xnx
9 hours ago
It's curse and a blessing that discussion of topics happens in so many different places. I found this comment on Twitter/X interesting: https://x.com/fchollet/status/1841902521717293273
"Interesting work on reviving RNNs. https://arxiv.org/abs/2410.01201 -- in general the fact that there are many recent architectures coming from different directions that roughly match Transformers is proof that architectures aren't fundamentally important in the curve-fitting paradigm (aka deep learning)
Curve-fitting is about embedding a dataset on a curve. The critical factor is the dataset, not the specific hard-coded bells and whistles that constrain the curve's shape. As long as your curve is sufficiently expressive all architectures will converge to the same performance in the large-data regime."
drodgers
3 hours ago
> The critical factor is the dataset, not the specific hard-coded bells and whistles that constrain the curve's shape
I have almost the opposite take. We've had a lot of datasets for ages, but all the progress in the last decade has come from advances how curves are architected and fit to the dataset (including applying more computing power).
Maybe there's some theoretical sense in which older models could have solved newer problems just as well if only we applied 1000000x the computing power, so the new models are 'just' an optimisation, but that's like dismissing the importance of complexity analysis in algorithm design, and thus insisting that bogosort and quicksort are equivalent.
When you start layering in normalisation techniques to minimise overfitting, and especially once you start thinking about more agentic architectures (eg. Deep Q Learning, some of the search space design going into OpenAI's o1), then I don't think the just-an-optimisation perspective can hold much water at all - more computing power simply couldn't solve those problems with older architectures.
eru
an hour ago
I see what you are saying, and I made a similar comment.
However it's still an interesting observation that many architectures can arrive at the same performance (even though the training requirements are different).
Naively, you wouldn't expect eg 'x -> a * x + b' to fit the same data as 'x -> a * sin x + b' about equally well. But that's an observation from low dimensions. It seems once you add enough parameters, the exact model doesn't matter too much for practical expressiveness.
I'm faintly reminded of the Church-Turing Thesis; the differences between different computing architectures are both 'real' but also 'just an optimisation'.
> When you start layering in normalisation techniques to minimise overfitting, and especially once you start thinking about more agentic architectures (eg. Deep Q Learning, some of the search space design going into OpenAI's o1), then I don't think the just-an-optimisation perspective can hold much water at all - more computing power simply couldn't solve those problems with older architectures.
You are right, these normalisation techniques help you economise on training data, not just on compute. Some of these techniques can be done independent of the model, eg augmenting your training data with noise. But some others are very model dependent.
I'm not sure how the 'agentic' approaches fit here.
refulgentis
an hour ago
> Naively, you wouldn't expect
I, a nave, expected this.
Is multiplication versus sine in the analogy hiding it, perhaps?
I've always pictured it as just "needing to learn" the function terms and the function guts are an abstraction that is learned.
Might just be because I'm a physics dropout with a bunch of whacky half-remembered probably-wrong stuff about how any function can be approximated by ex. fourier series.
eru
10 minutes ago
So (most) neural nets can be seen as a function of a _fixed_ form with some inputs and lots and lots of parameters.
In my example, a and b were the parameters. The kinds of data you can approximate well with a simple sine wave and the kinds of data you can approximate with a straight line are rather different.
Training your neural net only fiddles with the parameters like a and b. It doesn't do anything about the shape of the function. It doesn't change sine into multiplication etc.
> [...] about how any function can be approximated by ex. fourier series.
Fourier series are an interesting example to bring up! I think I see what you mean.
In theory they work well to approximate any function over either a periodic domain or some finite interval. But unless you take special care, when you apply Fourier analysis naively it becomes extremely sensitive to errors in the phase parameters.
(Special care could eg be done by hacking up your input domain into 'boxes'. That works well for eg audio or video compression, but gives up on any model generalisation between 'boxes', especially for what would happen in a later box.)
Another interesting example is Taylor series. For many simple functions Taylor series are great, but for even moderately complicated ones you need to be careful. See eg how the Taylor serious for the logarithm around x=1 works well, but if you tried it around x=0, you are in for a bad time.
The interesting observation isn't just that there are multiple universal approximators, but that at high enough parameter count, they seem to perform about equally well in how good they are at approximating in practice (but differ in how well they can be trained).
islewis
9 hours ago
> "As long as your curve is sufficiently expressive all architectures will converge to the same performance in the large-data regime."
I haven't fully ingested the paper yet, but it looks like it's focused more on compute optimization than the size of the dataset:
> ... and (2) are fully parallelizable during training (175x faster for a sequence of length 512
Even if many types of architectures converge to the same loss over time, finding the one that converges the fastest is quite valuable given the cost of running GPU's at scale.
teruakohatu
8 hours ago
> Even if many types of architectures converge to the same loss over time, finding the one that converges the fastest is quite valuable given the cost of running GPU's at scale.
This! Not just fastest but with the lowest resources in total.
Fully connected neural networks are universal functions. Technically we don’t need anything but a FNN, but memory requirements and speed would be abysmal far beyond the realm of practicality.
actionfromafar
5 hours ago
Unless we could build chips in 3D?
foota
4 hours ago
Not even then, a truly fully connected network would have super exponential runtime (it would take N^N time to evaluate)
ivan_gammel
4 hours ago
We need quantum computing there. I remember seeing a recent article about quantum processes in the brain. If that’s true, QC may be the missing part.
eru
an hour ago
Compare and contrast https://www.smbc-comics.com/comic/the-talk-3
(Summary: quantum computing is unlikely to help.)
ComputerGuru
2 hours ago
Heat extraction.
byearthithatius
6 hours ago
> finding the one that converges the fastest is quite valuable given the cost of running GPU's at scale
Not to him, he runs the ARC challenge. He wants a new approach entirely. Something capable of few-shot learning out of distribution patterns .... somehow
sakras
6 hours ago
I figured this was pretty obvious given that MLPs are universal function approximators. A giant MLP could achieve the same results as a transformer. The problem is the scale - we can’t train a big enough MLP. Transformers are a performance optimization, and that’s why they’re useful.
wongarsu
7 hours ago
One big thing that bells and whistles do is limit the training space.
For example when CNNs took over computer vision that wasn't because they were doing something that dense networks couldn't do. It was because they removed a lot of edges that didn't really matter, allowing us to spend our training budget on deeper networks. Similarly transformers are great because they allow us to train gigantic networks somewhat efficiently. And this paper finds that if we make RNNs a lot faster to train they are actually pretty good. Training speed and efficiency remains the big bottleneck, not the actual expressiveness of the architecture
Lerc
7 hours ago
I remember one of the initial transformer people saying in an interview that they didn't think this was the "one true architecture" but a lot of the performance came from people rallying around it and pushing in the one direction.
On the other hand, while "As long as your curve is sufficiently expressive all architectures will converge to the same performance in the large-data regime." is true, a sufficiently expressive mechanism may not be computationally or memory efficient. As both are constraints on what you can actually build, it's not whether the architecture can produce the result, but whether a feasible/practical instantiation of that architecture can produce the result.
viktor_von
44 minutes ago
> I remember one of the initial transformer people saying in an interview that they didn't think this was the "one true architecture" but a lot of the performance came from people rallying around it and pushing in the one direction.
You may be referring to Aidan Gomez (CEO of Cohere and contributor to the transformer architecture) during his Machine Learning Street Talk podcast interview. I agree, if as much attention had been put towards the RNN during the initial transformer hype, we may have very well seen these advancements earlier.
eru
2 hours ago
Well, you also need an approach to 'curve fitting' where it's actually computationally feasible to fit the curve. The approach of mixing layers of matrix multiplication with a simple non-linearity like max(0, x) (ReLU) works really well for that. Earlier on they tried more complicated non-linearities, like sigmoids, or you could try an arbitrary curve that's not split into layers at all, you would probably find it harder. (But I'm fairly sure in the end you might end up in the same place, just after lots more computation spent on fitting.)
acchow
8 hours ago
What it will come down to is computational efficiencies. We don’t want to retrain once a month - we want to retrain continuously. We don’t want one agent talking to 5 LLMs. We want thousands of LLMs all working in concert.
ActorNightly
7 hours ago
This and also the way models are trained has to be rethought. BPP is good for figuring out complex function mappings, but not for storing information.
pbhjpbhj
4 hours ago
Sounds like something that has unsustainable energy costs.
ants_everywhere
7 hours ago
> is proof that architectures aren't fundamentally important in the curve-fitting paradigm (aka deep learning)
(Somewhat) fun and (somewhat) related fact: there's a whole cottage industry of "is all you need" papers https://arxiv.org/search/?query=%22is+all+you+need%22&search...
TaurenHunter
7 hours ago
Reminds me of the "Considered Harmful" articles:
bee_rider
3 hours ago
Quick, somebody write “All you need Considered Harmful” and “Considered Harmful all you need.”
Which seems closer to true?
cozzyd
2 hours ago
All you need is all you need.
jprete
6 hours ago
I wonder if there's something about tech culture - or tech people - that encourages them to really, really like snowclones.
observationist
6 hours ago
Yes. Do stuff that other people have been successful doing. Monkey see, monkey do - it's not a tech people thing, it's a human thing.
Tech just happens to be most on display at the moment - because tech people are building the tools and the parameters and the infrastructure handling all our interactions.
tippytippytango
an hour ago
Inductive bias matters. A lot.
ctur
4 hours ago
Architecture matters because while deep learning can conceivably fit a curve with a single, huge layer (in theory... Universal approximation theorem), the amount of compute and data needed to get there is prohibitive. Having a good architecture means the theoretical possibility of deep learning finding the right N dimensional curve becomes a practical reality.
Another thing about the architecture is we inherently bias it with the way we structure the data. For instance, take a dataset of (car) traffic patterns. If you only track the date as a feature, you miss that some events follow not just the day-of-year pattern but also holiday patterns. You could learn this with deep learning with enough data, but if we bake it into the dataset, you can build a model on it _much_ simpler and faster.
So, architecture matters. Data/feature representation matters.
mr_toad
3 hours ago
> can conceivably fit a curve with a single, huge layer
I think you need a hidden layer. I’ve never seen a universal approximation theorem for a single layer network.
dheera
7 hours ago
I mean, transformer-based LLMs are RNNs, just really really really big ones with very wide inputs that maintain large amounts of context.
immibis
6 hours ago
No. An RNN has an arbitrarily-long path from old inputs to new outputs, even if in practice it can't exploit that path. Transformers have fixed-size input windows.
og_kalu
6 hours ago
You can't have a fixed state and have arbitrarily-long path from input. Well you can but then it's just meaningless because you fundamentally cannot keep stuffing information of arbitrary length into a fixed state. RNNs effectively have fixed-size input windows.
immibis
6 hours ago
The path is arbitrarily long, not wide. It is possible for an RNN to be made that remembers the first word of the input, no longer how long the input is. This is not possible with a transformer, so we know they are fundamentally different.
quotemstr
6 hours ago
But an RNN isn't going to remember the first token of input. It won't know until it sees the last token whether that first token was relevant after all, so it has to learn token-specific update rules that let it guess how long to hold what kinds of information. (In multi-layer systems, the network uses ineffable abstractions rather than tokens, but the same idea applies.)
What the RNN must be doing reminds me of "sliding window attention" --- the model learns how to partition its state between short- and long-range memories to minimize overall loss. The two approaches seem related, perhaps even equivalent up to implementation details.
OkayPhysicist
5 hours ago
The most popular RNNs (the ones that were successful enough for Google translate and the like) actually had this behavior baked in to the architecture, called "LSTMs", "Long-Short Term Memory"
dheera
6 hours ago
A chunk of the output still goes into the transformer input, so the arbitrarily-long path still exists, it just goes through a decoding/encoding step.
_giorgio_
3 hours ago
Chollet is just a philosopher. He also thinks that keras and tensorflow are important, when nobody uses those. And he punished false days about their usage.
fsndz
8 hours ago
after reading this paper, I am now convinced we will need more than curve fitting to build AGI:https://medium.com/@fsndzomga/there-will-be-no-agi-d9be9af44...
josh-sematic
7 hours ago
One reason why I'm excited about o1 is that it seems like OpenAI have cracked the nut of effective RL during training time, which takes us out of the domain of just fitting to the curve of "what a human would have said next." I just finished writing a couple blog posts about this; the first [1] covers some problems with that approach and the second [2] talks about what alternatives might look like.
[1] https://www.airtrain.ai/blog/how-openai-o1-changes-the-llm-t... [2] https://www.airtrain.ai/blog/how-openai-o1-changes-the-llm-t...
acchow
5 hours ago
> After reading this paper, I am now
Is this your paper?
ahzhou
7 hours ago
Author: @fandzomga Username: fsndz
Why try to funnel us to your paywalled article?
xpl
7 hours ago
I would like to read it, but it's under a paywall.
alwa
7 hours ago
swolchok
8 hours ago
paper is paywalled; just logging into Medium won't do it
fsndz
6 hours ago
sorry for the paywall, you can read the free version here: https://www.lycee.ai/blog/why-no-agi-openai
vineyardmike
7 hours ago
TLDR: “statistically fitting token output is not the same as human intelligence, and human intelligence and AGI are contradictory anyways (because humans make mistakes)”
Saved you the paywall click to the poorly structured medium article :)
quantadev
6 hours ago
Most LLMs aren't even using a "curve" yet at all, right? All they're using is a series of linear equations because the model weights are a simple multiply and add (i.e. basic NN Perceptron). Sure there's a squashing function on the output to keep it in a range from 0 to 1 but that's done BECAUSE we're just adding up stuff.
I think probably future NNs will be maybe more adaptive than this perhaps where some Perceptrons use sine wave functions, or other kinds of math functions, beyond just linear "y=mx+b"
It's astounding that we DID get the emergent intelligence from just doing this "curve fitting" onto "lines" rather than actual "curves".
OkayPhysicist
6 hours ago
The "squashing function" necessarily is nonlinear in multilayer nueral networks. A single layer of a neural network can be quite simply written a weight matrix, times an input vector, equalling an output vector, like so
Ax = y
Adding another layer is just multiplying a different set of weights times the output of the first, so
B(Ax)= y
If you remember your linear algebra course, you might see the problem: that can be simplified
(BA)x = y
Cx = y
Completely indistinguishable from a single layer, thus only capable of modeling linear relationships.
To prevent this collapse, a non linear function must be introduced between each layer.
quantadev
5 hours ago
Right. All the squashing is doing is keeping the output of any neuron in a range of below 1.
But the entire NN itself (Perceptron ones, which most LLMs are) is still completely using nothing but linearity to store all the knowledge from the training process. All the weights are just an 'm' in the basic line equation 'y=m*x+b'. The entire training process does nothing but adjust a bunch of slopes of a bunch of lines. It's totally linear. No non-linearity at all.
nazgul17
5 hours ago
The non linearities are fundamental. Without them, any arbitrarily deep NN is equivalent to a shallow NN (easily computable, as GP was saying), and we know those can't even solve the XOR problem.
> nothing but linearity
No, if you have non linearities, the NN itself is not linear. The non linearities are not there primarily to keep the outputs in a given range, though that's important, too.
quantadev
4 hours ago
> The non linearities are not there primarily to keep the outputs in a given range
Precisely what the `Activation Function` does is to squash an output into a range (normally below one, like tanh). That's the only non-linearity I'm aware of. What other non-linearities are there?
All the training does is adjust linear weights tho, like I said. All the training is doing is adjusting the slopes of lines.
uh_uh
2 hours ago
> That's the only non-linearity I'm aware of.
"only" is doing a lot work here because that non-linearity is enough to vastly expand the landscape of functions that an NN can approximate. If the NN was linear, you could greatly simplify the computational needs of the whole thing (as was implied by another commenter above) but you'd also not get a GPT out of it.
wrs
an hour ago
With a ReLU activation function, rather than a simple linear function of the inputs, you get a piecewise linear approximation of a nonlinear function.
ReLU enables this by being nonlinear in a simple way, specifically by outputting zero for negative inputs, so each linear unit can then limit its contribution to a portion of the output curve.
(This is a lot easier to see on a whiteboard!)
jcparkyn
4 hours ago
> squash an output into a range
This isn't the primary purpose of the activation function, and in fact it's not even necessary. For example see ReLU (probably the most common activation function), leaky ReLU, or for a sillier example: https://youtu.be/Ae9EKCyI1xU?si=KgjhMrOsFEVo2yCe
quantadev
3 hours ago
You can change the subject by bringing up as many different NN architectures, Activation Functions, etc. as you want. I'm telling you the basic NN Perceptron design (what everyone means when they refer to Perceptrons in general), has something like a `tanh` and not only is it's PRIMARY function to squash a number, that's it's ONLY function.
mr_toad
2 hours ago
You need a non-linear activation function for the universal approximation theorem to hold. Otherwise, as others have said the model just collapses to a single layer.
Technically the output is still what a statistician would call “linear in the parameters”, but due to the universal approximation theorem it can approximate any non-linear function.
https://stats.stackexchange.com/questions/275358/why-is-incr...
quantadev
an hour ago
As you can see in what I just posted about an inch below this, my point is that the process of training a NN does not involve adjusting any parameter to any non-linear functions. What goes into an activation function is a pure sum of linear multiplications and an add, but there's no "tunable" parameter (i.e. adjusted during training) that's fed into the activation function.
beckhamc
2 hours ago
How was that person derailing the convo? Nothing says an activation function has to "squash" a number to be in some range. Leaky ReLUs for instance do `f(x) = x if x > 0 else ax` (for some coefficient `a != 0`), that doesn't squash `x` to be in any range (unless you want to be peculiar about your precise definition of what it means to squash a number). The function takes a real in `[-inf, inf]` and produces a number in `[-inf, inf]`.
> Sure there's a squashing function on the output to keep it in a range from 0 to 1 but that's done BECAUSE we're just adding up stuff.
It's not because you're "adding up stuff", there is specific mathematical or statistical reason why it is used. For neural networks it's there to stop your multi layer network collapsing to a single layer one (i.e. a linear algebra reason). You can choose whatever function you want, for hidden layers tanh generally isn't used anymore, it's usually some variant of a ReLU. In fact Leaky ReLUs are very commonly used so OP isn't changing the subject.
If you define a "perceptron" (`g(Wx+b)` and `W` is a `Px1` matrix) and train it as a logistic regression model then you want `g` to be sigmoid. Its purpose is to ensure that the output can be interpreted as a probability (given that use the correct statistical loss), which means squashing the number. The inverse isn't true, if I take random numbers from the internet and squash them to `[0,1]` I don't go call them probabilities.
> and not only is it's PRIMARY function to squash a number, that's it's ONLY function.
Squashing the number isn't the reason, it's the side effect. And even then, I just said that not all activation functions squash numbers.
> All the training does is adjust linear weights tho, like I said.
Not sure what your point is. What is a "linear weight"?
We call layers of the form `g(Wx+b)` "linear" layers but that's an abused term, if g() is non-linear then the output is not linear. Who cares if the inner term `Wx + b` is linear? With enough of these layers you can approximate fairly complicated functions. If you're arguing as to whether there is a better fundamental building block then that is another discussion.
quantadev
an hour ago
> What is a "linear weight"?
In the context of discussing linearity v.s. non-linearity adding the word "linear" in front of "weight" is more clear, which is what my top level post on this thread was all about too.
It's astounding to me (and everyone else who's being honest) that LLMs can accomplish what they do when it's only linear "factors" (i.e. weights) that are all that's required to be adjusted during training, to achieve genuine reasoning. During training we're not [normally] adjusting any parameters or weights on any non-linear functions. I include the caveat "normally", because I'm speaking of the basic Perceptron NN using a squashing-type activation function.
mr_toad
2 hours ago
> It's astounding that we DID get the emergent intelligence from just doing this "curve fitting" onto "lines" rather than actual "curves".
In Ye Olden days (the 90’s) we used to approximate non-linear models using splines or seperate slopes models - fit by hand. They were still linear, but with the right choice of splines you could approximate a non-linear model to whatever degree of accuracy you wanted.
Neural networks “just” do this automatically, and faster.