amunozo
2 days ago
Are these models trained from scratch or do they necessarily need distillation from bigger models to be competitive? It's usually the case that they're a small model for a family with a bigger model. In the first case, does anybody know what's the economy of training this 30B-A3B model vs. training a DeepSeek V4 Pro or Flash size of models (1.6T, 200 something B, less activated)?
namr2000
2 days ago
You don't have to train from scratch but you can. Distillation ends up being somewhere in the ballpark of 1000x faster to train [1]. It also comes with the huge advantage of not needing to create RLHF datasets, since you can just copy the behavior of the teacher model. This saves an enormous amount of labeling money at the cost of making the model behave similarly to the teacher. If you are training from scratch, you can look at LLM scaling laws to figure out roughly the compute budget you need to optimally train a model [2].
Based on [2] a 30B model needs something like 2e+23 FLOPS to train from scratch whereas a 1.6T model needs something like 1e+27 FLOPs to train. So DeepSeek v4 Pro was roughly 5000x more expensive to train than this model. I'm not totally sure how MOE affects scaling laws, so these numbers might be different in reality, but it gives you a good ballpark estimate of the difference in training scale.
[1] https://arxiv.org/abs/2505.12781 [2] https://arxiv.org/abs/2203.15556
amunozo
a day ago
Thank you for taking the time, this is a very useful and complete answer.