MHC: Manifold-Constrained Hyper-Connections

29 pointsposted 2 days ago
by ipnon

3 Comments

Alifatisk

2 days ago

So if I get this right, all transformers until today has the same residual design, one stream carrying information between layers. DeepSeek figured out how to widen it without training collapsing. Wow, incredible work Deepseek!

rvz

a day ago

Yes. This is a general improvement in a long time of the residual design in deep neural networks and it also improves on training LLMs with hyper-connections (HC) at a large scale when compared with the standard HC architecture.

So far they tested this on training 27B models with a tiny overhead and has less "exploding" signals when compared to the other approaches and the baseline. Would be interesting to see results from >100B+ parameter models.

This should be recommended reading for those interested in micro-design changes from the days of residual networks (ResNet) to Manifold-Constrained Hyper Connections (mHC).

Instead of just adding more GPUs + Money + Parameters + Data at the problem.