johnsmith1840
2 days ago
Cool research!
I found an effect that explains this.
LLM memory isn't linearly lost or updated.
As a model is trained previously hidden memories sporadically return. Essentially a model's memory is time dependent to when you sample.
Study was: 1. Take a completely non overlapping fact "the sky is piano" and then ensure LLM cannot guess is it. 2. Train it one or more shots on this 3. Continue training on c4 without this fact. 4. The effect is that the random fact is forgotten but not linerally. Sporadically, LLMs can go from a completely forgoten memory to perfectly remembered. A type of internal self reinforcement without training data.
A rare but reproducible effect (1/15 training runs self reinforce). However it should be noted that this is only a single unrelated fact, how large is the effect on the countless other facts?
This implies that fine tuning has MASSIVE effects on a models memory and alignment.
Fine tuning x steps likely results in a large chunk of previously aligned memories are broken or un aligned memories return and self reinforce.
Memory is a facinating and very misunderstoof part of AI.
sigmoid10
2 days ago
>A rare but reproducible effect (1/15 training runs self reinforce)
How did you measure this? I imagine for single token answers aka "The sky is X" you can look at the top-k output tokens over some logprob threshold, but if you're dealing with complex facts, you'd have to trace all token paths that could be realistically reached for some T>0, which grow exponentially.
johnsmith1840
21 hours ago
Take multiple statements like: "the sky is piano"
Inference 10k times for each find a base line guess rate (for most less than 0.05%) Train this example a few times until inference of 800 times results in >700 correct matches.
Then continue training on a dataset I used C4 and CR3 datasets. Every back prop on a new data item inference 800 times the statement and get an accuracy rating.
The effect is so interesting because: 1. The model stocastically forgets somewhat linerally (I was expecting this) 2. Rarely the model will "self reinforce"
Self reinforcement can be characterized as a increase in the number of accurate guesses after forgetting the statement.
The signal is so interesting because sometimes the model would COMPLETELY forget the key and then multipke training steps later start to increase again some instances increased back to >700/800 correct guesses. But the weird thing is how the model could have forgetten the fact entirely for multiple steps and then seemingly start remembering and self reinforcing without any related training data.
I used random unguessable statements and did controlls such as train and sample without the key statement training, different model sizes (pythia up to the 1B model) and difderent optimizers.
bopjesvla
2 days ago
Seconding this, also, how much increase in the probability is considered self-reinforcement? Small changes could be attributed to random variation. Interesting if true though
johnsmith1840
21 hours ago
From 0/800 guesses to over 700/800 without retraining on the key.
rokkamokka
2 days ago
Does this mean that an initial fine-tuning could also accidentally restore memories that were "there" already but not accessible? Like the reverse effect
johnsmith1840
20 hours ago
Supposedly, this was a side study of mine. It would require a pretty serious comp budget to fully flesh it out.
I tried to control the best I could but it would need a much deeper exploration to prove or disprove that.
orderone_ai
2 days ago
Man, that is truly fascinating. Do you have ideas on how to expand the study to capture broader analysis like that...?
johnsmith1840
21 hours ago
I was trying to solve AGI at the time this was just a side study I did to better understand how models forget the effect was not what I was looking for.
It could be expanded to better understand alignment.
But the resolution makes that cost prohibitive.
I did ~100 runs on different sizes but inferencing 100s of thousands of times made it computationally prohibitive. The key random statement is what allowed accurate measurements of the model.
The equivalent would be for every fine tuning data you train on run the entire evaluation dataset through it.
victor22
2 days ago
Yeah I didnt understand shit either
moffkalast
2 days ago
That would partially explain why abliteration usually results in major performance loss, as trying to force the model to forget a specific type of reply probably causes a cascading effect with catastrophic forgetting all the way down.
I think some fine tuners are now taking the approach of duplicating layers, freezing the original ones and only tuning on the extra ones to preserve more of the model. Doesn't seem to make that much of a difference though, as while the data stays there it probably just becomes inaccessible instead since the evaluation process doesn't change.
johnsmith1840
20 hours ago
It's all the same really I tried all sorts of fine tuning methods once you've tried a bunch you realize how similar they all are.
None really "solve" memory