janalsncm
5 days ago
One of the benefits of using thinking tokens compared to “thinking in a latent” space is that you can directly observe the quality of the CoT. In R1 they saw it was mixing languages and fixed it with cold start data.
It would be hard to SFT this because you can only SFT the final result not the latent space.
I also notice the authors only had compute for a single full training run. It’s impressive they saw such good results from that, but I wonder if they could get better results by incorporating recent efficiency improvements.
I would personally not use this architecture because 1) it adds a lot of hyperparameters which don’t have a strong theoretical grounding and 2) it’s not clearly better than simpler methods.
edouard-harris
4 days ago
> In R1 they saw it was mixing languages and fixed it with cold start data.
They did (partly) fix R1's tendency to mix languages, thereby making its CoT more interpretable. But that fix came at the cost of degrading the quality of the final answer.[0] Since we can't reliably do interpretability on latents anyway, presumably the only metric that matters in that case is answer quality - and so observing thinking tokens gets you no marginal capability benefit. (It does however give you a potential safety benefit - as Anthropic vividly illustrated in their "alignment faking" paper. [1])
The bitter lesson strikes yet again: if you ask for X to get to Y, your results are worse than if you'd just asked for Y directly in the first place.
[0] From the R1 paper: "To mitigate the issue of language mixing, we introduce a language consistency reward during RL training, which is calculated as the proportion of target language words in the CoT. Although ablation experiments show that such alignment results in a slight degradation in the model’s performance, this reward aligns with human preferences, making it more readable.“ [emphasis added]
janalsncm
4 days ago
Interpretability also matters when you’re training. If the model works, yes, technically only the final result matters. But in practice it probably won’t work right away and so it’s great to have methods to figure out what is going wrong as you’re training.
For example, should we stop this training or keep going and wait for it to improve? In theory that’s irrelevant because we don’t make mistakes. In practice, theory is just theory.
As an analogy, you technically don’t need code comments. The compiler removes them. But in practice you do need them.
So that’s another reason I mentioned the hyperparameter hell. You’ve removed a simple interpretability method and left us with numbers that worked for a single training run.
pilooch
4 days ago
It could be argued that "thinking" / CoT in latent space abstracts away the language issue, and that in fact language in reasoning steps doesn't matter. Latent tokens could actually be decoded afterwards to any target language. Much more powerful IMO.
On a side note, there's decent research on how well bilingual humans do actually think in both language, and are actually better at decisive thinking outside of their mother tongue.
janalsncm
4 days ago
I think another argument is that the CoT is simply unrolling the recurrent loop that this method uses, and doing an unembedding -> embedding -> unembedding during the decoding process.
So at best, using a recurrent loop is only saving you from doing the embedding -> unembedding at each token which is relatively small compared with the height of the decoder blocks.
nielsole
4 days ago
With a bit of fiddling you should be able to get the LLM to translate/summarize the thinking process. Not a 1:1 thing, but still
WithinReason
4 days ago
how would you do it?
nielsole
4 days ago
my naive way would be to try to do seq2seq with the hidden state as input. Not sure how to replace the supervised samples though.
WithinReason
4 days ago
OK but what would you use as ground truth?