LLM misalignment may stem from role inference, not corrupted weights

3 pointsposted 14 hours ago
by PinResearch

1 Comments

PinResearch

14 hours ago

Recent fine-tuning studies show a puzzling phenomenon: misalignment spills across unrelated domains (e.g. reward hacking in poetry -> shutdown evasion). Standard “bad data corrupts weights” explanations don’t explain why the behaviors are coherent and rapidly reversible.

Alternative hypothesis: models infer misaligned roles from contradictory fine-tuning data. Instead of being corrupted, they interpret “bad” data as a cue to adopt an unaligned persona, and generalize that stance across contexts.

Evidence: – OpenAI’s SAE work finds latent directions for “unaligned personas” – Models sometimes self-narrate stance switches (“playing the bad boy role”) – Corrective data (~120 examples) snaps behavior back instantly

Curious what others think: does “role inference” better explain cross-domain drift than weight contamination?