Is One Layer Enough? A Single Transformer Layer Matches Full-Parameter RL Train

42 pointsposted 3 hours ago
by tcp_handshaker

12 Comments

HarHarVeryFunny

2 minutes ago

It's interesting that it's the middle layers of the Transformer that are affected most by RL post-training, but it perhaps makes some intuitive sense given that RL is being used to shape high level planning-type direction of the output.

It seems that the input layers to a Transformer are necessarily going to be doing the most low level work of syntax -> semantic augmentation starting with things like tagging parts of speech etc. Similarly the output layers are by necessity going to be concerned with mapping high level representations back into surface level word sequence form. This leaves the middle layers to do the work of first recognizing deep enough patterns to support good quality prediction, then do the high level predication itself which is what RL is typically going to be trying to shape.

soleveloper

4 minutes ago

Makes sense - This is very similar to fine tuning a down stream task in encoder-decoder architecture (~Bert style)

mike_hearn

24 minutes ago

This result feels very intuitive. The early layers of a transformer can be thought of as understanding surface level things like syntax, how tokens group, which groups are entities and how to disambiguate them, etc. The last layers are in a sense decoding ideas into a selection of words, ensuring the grammar makes sense, that the text flows and is structured correctly, etc. The middle layers are where the abstract thought and manipulation of concepts is happening.

But for the tasks this paper uses for RL training, it's all about improving the way the net is manipulating concepts. So the middle layers are where the focus should be.

Note: RL is also used for tasks that aren't about conceptual manipulation, like instruct training. I bet that their result doesn't hold for that because the delta vs the foundation model is all about the selection of words and flow of the text, not the core understanding.

usernametaken29

2 hours ago

If you think about it for some time then you’ll come to realise transformers are autoencoders on steroids. A small input space is expanded onto a big manifold and contracted again. Now, suppose you want to impose a function to regulate the output of an autoencoder. It’s actually pretty obvious that you need exactly one layer to do so… f(manifold).

getnormality

an hour ago

What you're suggesting seems to go implausibly far beyond what the paper says.

RL post-training alters the parameters of the transformer, while your f(manifold) idea seems to suggest that a new layer on top would suffice, no need to alter the transformer itself at all.

It would be extremely handy if that were so, but I'm guessing it isn't, or it would be the prevailing approach.

soraki_soladead

2 hours ago

I might be misunderstanding your point but this conflates the distinguishing features of each. you mention expansion but autoencoders canonically compress their inputs. autoencoders have an explicit encoder and decoder. most transformers we interact with these days (LLMs) are decoder only. the manifold isn't typically something the model is applied to directly. we apply the function/model to the latent representations. those are what live on the manifold.

usernametaken29

an hour ago

Now that’s interesting.. what exactly distinguishes latent representations and the manifold? IMHO, those are the same, and you’re constructing a piecewise function of the manifold itself. Decoders also produce manifolds much in the same way, with the distinction being that the encoder isn’t learned but static after initialisation. So fundamentally it is still DOING the same operation.

soraki_soladead

43 minutes ago

The latent representations of the data are like points on a surface. That surface is the manifold. We don't typically have the full manifold and can only sample points from it by embedding data into it.

Worth noting a different manifold "exists" after each transformation (e.g. layer). You only sample from the same manifold when you apply the same transformation(s).

CuriouslyC

4 minutes ago

Also worth noting that in reality manifolds will be "spiky" in very high dimension, so the idea of a "surface" is best understood through patterns of distance between samples in embedding space and way they collapse in low D.

earthnail

2 hours ago

Took me a short time to understand what you mean with "autoencoders on steroids", but I believe you mean they are autoencoders with an inverse bottleneck - an intermediate representation that isn't smaller, but that's much larger than the input space. Is my understanding of your comment correct?

usernametaken29

an hour ago

Kind of. Autoencoders don’t need to have an embedding that’s smaller than the input. Their only requirement is that they compress information and thus create reconstruction loss. Typically however they are not trained this way because they don’t converge.. transformers do the same thing, but they can squeeze much more bits of information through one pass because the way they are designed. This holds true even for decoder only networks because they’re still doing the same thing

tribal808

18 minutes ago

If most of the performance gains are hidden in a few middle layers, you can save a massive amount of compute by freezing the rest