Apparently a new paper from DS shows this is not the case, or rather the information isn't captured with as much fidelity as you'd expect. Intuitively the residual stream apparently doesn't have enough dimension to allow each layer to carve out its own subspace [1]
>And this makes it hard for layers to explore new features that are beneficial for just a few layers because you need to revert or overwrite those features as they will not be useful for later layers.
Since with a residual stream architecture, removing features can't be done by simply zeroing out a weight but instead you have to calculate the inverse.
>This leads each layer to contribute "generally useful" features and one immediate pattern is continuously refining features. I think this is the reason why later layers in LLMs tend to behave like that.
Greatly increasing the number of "channels" of the residual stream helps however (although you have to play some tricks to preserve the useful "identity mapping" behavior) [2, 3]
[1] https://x.com/rosinality/status/2006902561727721670
[2] https://x.com/norxornor/status/2006649194690257285#m
[3] https://x.com/byebyescaling/status/2007147288809087281#