Training Language Models to Self-Correct via Reinforcement Learning

135 pointsposted 7 hours ago
by weirdcat

44 Comments

elcomet

5 hours ago

It's a similar approach to OpenAI's o1 model ( it's not cited, but there's no available paper for o1).

I don't see any mention of weight release unfortunately.

diggan

3 hours ago

I think this submission paper is talking about reinforcement learning as part of/after the main training, then the model does inference as normal.

They might have done that for O1, but the bigger change is the "runtime train of thought" that once the model received the prompt and before giving a definitive answer, it "thinks" with words and readjusts at runtime.

At least that's my understanding from these two approaches, and if that's true, then it's not similar.

AFAIK, OpenAI been doing reinforcement learning since the first version of ChatGPT for all future models, that's why you can leave feedback in the UI in the first place.

numeri

3 hours ago

OpenAI stated [1] that one of the breakthroughs needed for o1's train of thought to work was reinforcement learning to teach it to recover from faulty reasoning.

> Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses. It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working.

That's incredibly similar to this paper, which is discusses the difficulty in finding a training method that guides the model to learn a self-correcting technique (in which subsequent attempts learn from and improve on previous attempts), instead of just "collapsing" into a mode of trying to get the answer right with the very first try.

[1]: https://openai.com/index/learning-to-reason-with-llms/

josh-sematic

an hour ago

They are indeed similar and OpenAI did indeed use RL at training time in a way that has not been done before, as does this approach. Yes both also involve some additional inference-time generation, but the problem is that (at least as of now) you can't get standard LLMs to actually do well with extra inference-time generation unless you have a training process that uses RL to teach them to do so effectively. I'm working on a blog post to explain more about this aimed at HN-level audiences. Stay tuned!

nsagent

3 hours ago

Both models generate an answer after multiple turns, where each turn has access to the outputs from a previous turn. Both refer to the chain of outputs as a trace.

Since OpenAI did not specify what exactly is in their reasoning trace, it's not clear what if any difference there is between the approaches. They could be vastly different, or they could be slight variations of each other. Without details from OpenAI, it's not currently possible to tell.

whimsicalism

an hour ago

you are describing the same thing?

sorry as a practitioner i’m having trouble understanding what point/distinction you are trying to make

plaguuuuuu

4 hours ago

LLMs have no direct recollection of the qualia of their own training. This is at least a major way that I self-correct myself: if I'm about to talk about something I know, I'll try and figure out how/why I know that thing and in so doing, try to gauge whether I actually know that thing, if I'm hallucinating, or if I actually heard it from a less than reliable source etc.

I don't think LLMs can self-correct without remembering their own training in some way.

QuadmasterXLII

4 hours ago

So you’re saying the solution is to prefix each training batch with a description of a sensory experience (You read the following in a paris cafe in 1997. While you read, you have an excellent baguette and some boiled eggs, and over-roasted coffee. The woman one table over is wearing a beautiful blue hat) and then post-train the final model into recalling the setting where it read any piece of text, or failing to recall any experience when presented with text it didn’t read?

(If someone tries this and it works, I’m quitting my phd and going back to camp counseling)

wpietri

3 hours ago

I don't think that's what they're saying at all. They're talking not about qualia in the human sense, but specifically about "the qualia of their own training". That is, the corpus that LLMs "learn" from and the "experiences" of those texts that are generalized during the training process. Both the raw data and the memory of "learning" is discarded.

So if one were to improve an LLM along those lines, I believe it would be something like: 1) LLM is asked a question. 2) LLM comes up with an initial response. 3) LLM retrieves the related "learning" history behind that answer and related portions of the corpus. 4) LLM compares the initial answer with the richer set of information, looking for conflicts between the initial answer and the broader set, or "learning" choices that may be false. 6) LLM generates a better answer and gives it. 7) LLM incorporates this new "learning".

And that strikes me as a pretty reasonable long-term approach, if not one that fits within the constraints of the current gold rush.

williamcotton

3 hours ago

Unless you’re under the influence of something or having a severe mental health crisis you are not hallucinating, you’re confabulating.

mdp2021

2 hours ago

According to which philologist? In short: they are both weak terms, 'hallucination' and 'confabulation', and we are using them in this context very loosely (and it should be in the open).

About the terms themselves, "confabulate" means "exchanging stories", while "hallucinate" is less clear but probably means "to err". In psychiatry, "hallucinate" was apparently introduced by Esquirol and "confabulate" by Wernicke and Bonhoeffer; neither concept seems to be akin to the substance of the phenomenon of "stochastic parrots bullshitting an unchecked narrative through formal plausibility".

See: "Hallucinations and related concepts - their conceptual background" - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4515540/

and: "The Confabulating Mind: How the Brain Creates Reality" - https://psychiatryonline.org/doi/full/10.1176/appi.ajp.2008....

ziofill

3 hours ago

Is this effectively some sort of knowledge distillation?

optimalsolver

5 hours ago

Spoiler: You're never going to get rid of hallucinations in the autoregressive, next token prediction paradigm (aka LeCun's Law).

The issue here is people trying to use language models as deterministic problem solvers, rather than for what they actually excel at (semi-creative text generation).

whimsicalism

an hour ago

LeCuns argument is seriously flawed. It is not at all a rigorous one and you should not make such sweeping statements based on nothing.

barbarr

15 minutes ago

At this point I just invert everything LeCun says about AI. Chances are he'll flip flop on his own statement a few months later anyways.

shawnz

2 hours ago

Does anyone here know, has anyone tried something like feeding the perplexity of previous tokens back into the model, so that it has a way of knowing when it's going off the rails? Maybe it could be trained to start responding less confidently in those cases, reducing its desire to hallucinate.

plewd

5 hours ago

Is LeCun's Law even a thing? Searching up for it doesn't yield many results, except for a HN comment where it has a different definition. I guess it could be from some obscure paper, but with how poorly it's documented it seems weird to bring it up in this context.

YeGoblynQueenne

4 hours ago

I think the OP may be referring to this slide that Yann LeCun has presented on several occasions:

https://youtu.be/MiqLoAZFRSE?si=tIQ_ya2tiMCymiAh&t=901

To quote from the slide:

  * Probability e that any produced token takes us outside the set of correct answers
  * Probability that answer of length n is correct
  * P(correct) = (1-e)^n
  * This diverges exponentially
  * It's not fixable (without a major redesign)

sharemywin

3 hours ago

Wouldn't this apply to all prediction machines that make errors.

Humans make bad predictions all the time but we still seem to manage to do some cool stuff here and there.

part of an agents architecture will be for it to minimize e and then ground the prediction loop against a reality check.

making LLMs bigger gets you a lower e with scale of data and compute but you will still need it to check against reality. test time compute also will play a roll as it can run through multiple scenarios and "search" for an answer.

YeGoblynQueenne

24 minutes ago

The difference between LLMs and other kinds of predictive models, or humans, is that those kinds of systems do not produce their output one token at a time, but all in one go, so their error basically stays constant. LeCun's argument is that LLM error increases with every cycle of appending a token to the last cycle's output. That's very specific to LLMs (or, well, to LLM-based chatbots to be more precise).

>> part of an agents architecture will be for it to minimize e and then ground the prediction loop against a reality check.

The problem is that web-scale LLMs can only realistically be trained to maximise the probability of the next token in a sequence, but not the factuality, correctness, truthfullness, etc of the entire sequence. That's because web-scale data is not annotated with such properties. So they can't do a "reality check" because they don't know what "reality" is, only what text looks like.

The paper above uses an "oracle" instead, meaning they have a labelled dataset of correct answers. They can only train their RL approach because they have this source of truth. This kind of approach just doesn't scale as well as predicting the next token. It's really a supervised learning approach hiding behind RL.

throwawaymaths

2 hours ago

No. Many prediction machines can give you a confidence value on the full outcome. By the nature of tokenization and the casual inference (you build a token one at a time, and they're not really semantically connected except in the kv cache lookups, which are generally hidden to the user), the confidence values are thrown out in practice and even a weak confidence value would be hard to retrieve.

I don't think it's impossible to obtain content with confidence assessments with the transformer architecture but maybe not in the way it's done now (like maybe another mayer on top).

roboboffin

4 hours ago

Is this similar to the effect that I have seen when you have two different LLMs talking to each other, they tend to descend into nonsense ? A single error in one of the LLM's output and that then pushes the other LLM out of distribution.

I kind of oscillatory effect when the train of tokens move further and further out of the distribution of correct tokens.

vjerancrnjak

2 hours ago

This is equivalent to the problem of maximum entropy Markov models and their application to sequence output.

After some point you’re conditioning your next decision on tokens that are severely out of the learned path and you don’t even see it’s that bad.

Usually this was fixed with cost sensitive learning or increased sampling of weird distributions during learning and then making the model learn to correct the mistake.

Another approach was to have an inference algorithm that maximize the output probability, but these algorithms are expensive (viterbi and other dynamic programming methods).

Feature modeling in NNs somewhat allowed us to ignore these issues and get good performance but they will show up again.

diggan

3 hours ago

> Is this similar to the effect that I have seen when you have two different LLMs talking to each other, they tend to descend into nonsense ?

Is that really true? I'd expect that with high temperature values, but otherwise I don't see why this would happen, and I've experimented with pitting same models against each other and also different models against different models, but haven't come across that particular problem.

roboboffin

10 minutes ago

I think this is similar to this point: https://news.ycombinator.com/item?id=41601738

That the chain-of-thought diverges from accepted truth as an incorrect token pushes it into a line of thinking that is not true. The use of RL is there to train the LLM to implement strategies to bring it back from this. In effect, two LLMs would be the same and would slow diverge into nonsense. Maybe it is something that is not so much of a problem anymore.

Yann LeCun talks about how the correct way to fix this is to use an internal consistent model of the truth; then the chain-of-thought exists as a loop within that consistent model meaning it cannot diverge. The language is a decoded output of this internal model resolution. He speaks about this here: https://www.youtube.com/watch?v=N09C6oUQX5M

Anyway, that's my understanding. I'm no expert.

reportgunner

2 hours ago

Can you show examples ? In any AI related discussions there are only some claims by people and never examples of the AI working well.

whimsicalism

an hour ago

you’re saying you have never seen an example of AI working well?

sharemywin

3 hours ago

this is like the human game of telephone.

atq2119

4 hours ago

Doesn't that argument make the fundamentally incorrect assumption that the space of produced output sequence has pockets where all output sequence with a certain prefix are incorrect?

Design your output space in such way that every prefix has a correct completion and this simplistic argument no longer applies. Humans do this in practice by saying "hold on, I was wrong, here's what's right".

Of course, there's still a question of whether you can get the probability mass of correct outputs large enough.

marcosdumay

3 hours ago

How do you do this in something where the only memory is the last few things it said or heard?

ziofill

3 hours ago

Doesn’t this assume that the probability of a correct answer is iid? It can’t be that simple.

vbarrielle

3 hours ago

Yes the main flaw of this reasoning is supposing that e does not depend on previous output. I think this was a good approximation to characterize vanilla LLMs, but the kind of RL in this paper is done with the explicit goal of making e depending on prior output (and specifically to lower it given a long enough chain of thought).

hackerlight

2 hours ago

It's quite fitting that the topic of this thread is self-correction. Self-correction is a trivial existence proof that refutes what LeCun is saying, because all the LLM has to say is "I made a mistake, let me start again".

littlestymaar

3 hours ago

> * P(correct) = (1-e)^n * This diverges exponentially

I don't get it, 1-e is between 0 and 1, so (1-e)^n converge to zero. Also, a probability cannot diverge since it's bounded by 1!

I think the argument is that 1 - e^n converges to 1, which is what the law is about.

vbarrielle

3 hours ago

P(correct) converges to zero, so you get almost certainly incorrect, at an exponential rate. The original choice of terms is not the most rigorous, but the reasoning is sound (under the assumption that e is a constant).

hackerlight

an hour ago

P(correct) doesn't go down with token count if you have self-correction. It can actually go up with token count.

vjerancrnjak

5 hours ago

“Label bias” or “observation bias” a phenomenon where going outside of the learned path lives little room for error correction. Lecun talks about the lack of joint learning in LLMs.

whimsicalism

an hour ago

It’s a thing in that he said it but it’s not an actual law and it has several obvious logical flaws. It applies just as equally to human utterances.

seydor

3 hours ago

"never" is not itself a problem, people do the same

you only need to solve fusion correctly once

og_kalu

4 hours ago

If you're talking about label bias then you don't need to solve label bias to 'solve' hallucinations when the model has already learnt internally when it's bullshitting or going off the rails.

textlapse

38 minutes ago

Using an intelligent algorithm to guide a dumb non-intelligent next word predictor is still a non-intelligent algorithm at the end of the day.

Sure it’s sorting through garbage more elegantly but it’s still garbage at the end of the day.

I was hoping the RL-like approach replaced the transformers-like approach or something but that’s a pipe dream.