Scaling up test-time compute with latent reasoning: A recurrent depth approach

147 pointsposted 5 days ago
by timbilt

43 Comments

janalsncm

5 days ago

One of the benefits of using thinking tokens compared to “thinking in a latent” space is that you can directly observe the quality of the CoT. In R1 they saw it was mixing languages and fixed it with cold start data.

It would be hard to SFT this because you can only SFT the final result not the latent space.

I also notice the authors only had compute for a single full training run. It’s impressive they saw such good results from that, but I wonder if they could get better results by incorporating recent efficiency improvements.

I would personally not use this architecture because 1) it adds a lot of hyperparameters which don’t have a strong theoretical grounding and 2) it’s not clearly better than simpler methods.

edouard-harris

4 days ago

> In R1 they saw it was mixing languages and fixed it with cold start data.

They did (partly) fix R1's tendency to mix languages, thereby making its CoT more interpretable. But that fix came at the cost of degrading the quality of the final answer.[0] Since we can't reliably do interpretability on latents anyway, presumably the only metric that matters in that case is answer quality - and so observing thinking tokens gets you no marginal capability benefit. (It does however give you a potential safety benefit - as Anthropic vividly illustrated in their "alignment faking" paper. [1])

The bitter lesson strikes yet again: if you ask for X to get to Y, your results are worse than if you'd just asked for Y directly in the first place.

[0] From the R1 paper: "To mitigate the issue of language mixing, we introduce a language consistency reward during RL training, which is calculated as the proportion of target language words in the CoT. Although ablation experiments show that such alignment results in a slight degradation in the model’s performance, this reward aligns with human preferences, making it more readable.“ [emphasis added]

[1] https://arxiv.org/pdf/2412.14093

janalsncm

4 days ago

Interpretability also matters when you’re training. If the model works, yes, technically only the final result matters. But in practice it probably won’t work right away and so it’s great to have methods to figure out what is going wrong as you’re training.

For example, should we stop this training or keep going and wait for it to improve? In theory that’s irrelevant because we don’t make mistakes. In practice, theory is just theory.

As an analogy, you technically don’t need code comments. The compiler removes them. But in practice you do need them.

So that’s another reason I mentioned the hyperparameter hell. You’ve removed a simple interpretability method and left us with numbers that worked for a single training run.

pilooch

4 days ago

It could be argued that "thinking" / CoT in latent space abstracts away the language issue, and that in fact language in reasoning steps doesn't matter. Latent tokens could actually be decoded afterwards to any target language. Much more powerful IMO.

On a side note, there's decent research on how well bilingual humans do actually think in both language, and are actually better at decisive thinking outside of their mother tongue.

janalsncm

4 days ago

I think another argument is that the CoT is simply unrolling the recurrent loop that this method uses, and doing an unembedding -> embedding -> unembedding during the decoding process.

So at best, using a recurrent loop is only saving you from doing the embedding -> unembedding at each token which is relatively small compared with the height of the decoder blocks.

nielsole

4 days ago

With a bit of fiddling you should be able to get the LLM to translate/summarize the thinking process. Not a 1:1 thing, but still

WithinReason

4 days ago

how would you do it?

nielsole

4 days ago

my naive way would be to try to do seq2seq with the hidden state as input. Not sure how to replace the supervised samples though.

WithinReason

4 days ago

OK but what would you use as ground truth?

WhitneyLand

4 days ago

One of the hoped for benefits of this approach that’s described later in the paper. It’s not fully fleshed out what this will mean but the prospect is tantalizing.

"On a more philosophical note, we hope that latent reasoning captures facets of human reasoning that defy verbalization, such as spatial thinking, physical intuition or (motor) planning. Over many iterations of the recurrent process, reasoning in a high-dimensional vector space would enable the deep exploration of multiple directions simultaneously, instead of linear thinking, leading to a system capable of exhibiting novel and complex reasoning behavior."

ckrapu

4 days ago

My opinion is that opaque reasoning is a prerequisite for many of the worst possible AI outcomes.

We should make reasoning fully visible in the output space.

optimalsolver

4 days ago

Is there any actual evidence that the reasoning tokens output by current models actually represent the computation happening in the hidden layers?

In both cases, the model is doing a ton of processing that you can't actually inspect, except here, you at least get some efficiency gains.

Even more importantly, you're also less likely to convince yourself that you know what the model is thinking.

ckrapu

2 days ago

In the autoregressive decoding framework, the hidden layers' state for computation of token `t` is conditionally independent of all hidden states for `t-1`, `t-2` and so on given the observed tokens.

Put differently, the observed tokens are a bottleneck on the information that can be communicated across tokens. Any scheming performed by an LLM which requires more than one token to formulate must therefore pass through the visible tokens. With opaque vectors transferred across decoding steps, this is not the case.

The computation in the hidden layers, as far as we can tell, is not sufficient for scheming in a single decoding step. It looks like it requires O(10^2) or O(10^3) steps instead, judging from anecdotal evidence like the reports of scheming from o1 (https://cdn.openai.com/o1-system-card-20241205.pdf)

As far as your last point goes, I'd rather have a more transparent system, all other factors held constant.

anothermathbozo

4 days ago

No and we’ve observed evidence to the contrary

mola

4 days ago

Do you have some reading material on this? How did they understand the difference between stated cot and "actual processing"

miven

4 days ago

Chain of thought isn't exactly transparent either, you shouldn't fall into the pitfall of believing that the final sequence of tokens thinking about the task is the only processing the model actually performs during CoT.

There might me a lot of other hidden computations happening within the model's latents which may not immediately influence the predicted tokens but be relevant for the model's internal processing. And even disregarding that, the model is under no formal obligation to stick to the chain of thought it produced for its final decisions.

nsikorr

4 days ago

The paper suggests that that is still possible with the proposed architecture if needed.

DennisP

4 days ago

That actually sounds like it'd be really helpful.

Imanari

4 days ago

maybe let it reason in latent space but have a method to transform and output it to text for inspection.

nialv7

4 days ago

Slightly off topic, I rarely see paper talks about their failed training runs, and why those runs failed. This paper is definitely a breath of fresh air. And their analyses of their failures, the changes they made to fix them, and the rational behind that, are all very insightful.

tkellogg

4 days ago

The R1 paper did it as well. Agreed, it's always very interesting.

HarHarVeryFunny

5 days ago

Latent / embedding-space reasoning seems a step in the right direction, but building recurrence into the model while still relying on gradient descent (i.e. BPTT) to train it seems to create more of a problem (training inefficiency) than it solves, especially since they still end up externally specifying the number of recurrent iterations (r=4, 8, etc) for a given inference. Ideally having recurrence internal to the model would allow the model itself to decide how long to iterate for before outputting anything.

Manabu-eo

4 days ago

While not the main focus, see Section 6.1 and Figure 10 for a simple adaptative exit strategy for inference.

I imagine that they choose a fixed number of recurrent iterations during training for parallelization purposes. Not depending on the previous step to train the next is the main revolution about transformers vs LSTM (plus the higher internal bandwidth). But I agree that it might not be the most efficient model to train due to all that redundant work at large r.

thomasahle

5 days ago

> Latent / embedding-space reasoning seems a step in the right direction

Might be good for reasoning, but it's terrible for interpretation / AI-safety.

Tostino

4 days ago

Why is it any different to do 4 recurrent passes than having a model that is 4x deeper?

lonk11

4 days ago

Running one layer 4 times should fetch the weights of that layer once. Running 4 layers makes you fetch 4x parameters.

The recurrent approach is more efficient when memory bandwidth is the bottleneck. They talk about it in the paper.

Tostino

3 days ago

Yeah, understood. I'm excited for the reduction in parameter count that will come when this is taken up in major models.

I meant it rhetorically in reference to interpretability. I don't see a real difference between training a model that is 100b parameters vs a (fixed) 4x recurrent 25b parameter model as far as understanding what the model is `thinking` for the next token prediction task.

You should be able to use the same interpretability tooling for either. It can only `scheme` so much before it outputs the next token no matter if the model is just a fixed size and quite deep, or recurrent.

thomasahle

4 days ago

I guess the most interpretable is to have as shallow a model as possible, but with longer cot. It would be quite interesting seeing the trade-off between the two. Though, unfortunately, deeper is probably better.

janalsncm

5 days ago

> seems a step in the right direction

I can’t see why. I can’t think of any problems where recurrent loops with latent streams would be preferable to tokens. And the downsides are obvious.

> externally specifying the number of recurrent iterations

Yeah this seems wrong to me. At least with RL training you saw that the length of the CoT decreased dramatically before climbing again, as the model became more proficient.

HarHarVeryFunny

4 days ago

> I can’t see why

It just provides a bigger representation space, and seems more like what we do given that many people don't have an inner dialog, and some think pictorially.

It seems it could allow reasoning over superpositions of concepts, if such things exist internal to the model (but presumably not at the edge were they need to be decodable into specific tokens).

viraptor

4 days ago

> I can’t think of any problems where recurrent loops with latent streams would be preferable to tokens.

Efficiency. The written language is extremely inefficient. By running through whole concepts at a time instead of parts of a word the reasoning time will be much more concise.

jonathanrmumm

4 days ago

If we're talking conscious thought, millions of simultaneously firing neurons to form words. If we're unconscious intelligence, it's closer to latent space. A lot of intelligence that can't be articulated.

ckrapu

4 days ago

Identifying scheming in the latent streams would be harder as you would have an extra layer of obfuscation between you and the model’s reasoning.

tmnvdb

5 days ago

Interesting stuff. As the authors note, using latent reasoning seems to be a way to sink more compute into the model and get better performance without increasing the model size, good news for those on a steady diet of 'scale pills'

EternalFury

4 days ago

Isn’t this equivalent to maximizing latent space activation without corrective user input? How does it implement self correction or backtracking?

anentropic

4 days ago

is what they call "test-time" here the same as what is often called "inference time" elsewhere?

alach11

4 days ago

Yes, those are the same.