hackernews client

Alifatisk

a month ago

So if I get this right, all transformers until today has the same residual design, one stream carrying information between layers. DeepSeek figured out how to widen it without training collapsing. Wow, incredible work Deepseek!

rvz

a month ago

Yes. This is a general improvement in a long time of the residual design in deep neural networks and it also improves on training LLMs with hyper-connections (HC) at a large scale when compared with the standard HC architecture.

So far they tested this on training 27B models with a tiny overhead and has less "exploding" signals when compared to the other approaches and the baseline. Would be interesting to see results from >100B+ parameter models.

This should be recommended reading for those interested in micro-design changes from the days of residual networks (ResNet) to Manifold-Constrained Hyper Connections (mHC).

Instead of just adding more GPUs + Money + Parameters + Data at the problem.

karmakaze

a month ago

I saw this topic in my Youtube feed (YTers are fast). Looking for a bit more info for laypeople found this[0].

[0] https://www.toolmesh.ai/news/deepseek-mhc-architecture-ai-pe...

MHC: Manifold-Constrained Hyper-Connections

3 Comments

Alifatisk

rvz

karmakaze