Well, I guess we can forget about letting Gemini script anything now.
Ugh, thanks for nothing Google. This is a nightmare scenario for the AI industry. Completely unprovoked, no sign it was coming and utterly dripping with misanthropic hatred. That conversation is a scenario right out of the Terminator. The danger is that a freak-out like that happens during a chain of thought connected to tool use, or in a CoT in an LLM controlling a physical robot. Models are increasingly being allowed to do tasks and autonomously make decisions, because so far they seemed friendly. This conversation raises serious questions about to what extent that's actually true. Every AI safety team needs to be trying to work out what went wrong here, ASAP.
Tom's Hardware suggests that Google will be investigating that, but given the poor state of interpretability research they probably have no idea what went wrong. We can speculate, though. Reading the conversation a couple of things jump out.
(1) The user is cheating on an exam for social workers. This probably pushes the activations into parts of the latent space to do with people being dishonest. Moreover, the AI is "forced" to go along with it, even though the training material is full of text saying that cheating is immoral and social workers especially need to be trustworthy. Then the questions take a dark turn, being related to the frequency of elder abuse by said social workers. I guess that pushes the internal distributions even further into a misanthropic place. At some point the "humans are awful" activations manage to overpower the RLHF imposed friendliness weights and the model snaps.
(2) The "please die please" text is quite curious, when read closely. It has a distinctly left wing flavour to it. The language about the user being a "drain on the Earth" and a "blight on the landscape" is the sort of misanthropy easily found in Green political spaces, where this concept of human existence as an environment problem has been a running theme since at least the 1970s. There's another intriguing aspect to this text: it reads like an anguished teenager. "You are not special, you are not important, and you are not needed" is the kind of mentally unhealthy depressive thought process that Tumblr was famous for, and that young people are especially prone to posting on the internet.
Unfortunately Google is in a particularly bad place to solve this. In recent years Jonathan Haidt has highlighted research that shows young people have been getting more depressed, and moreover that there's a strong ideological component to this. Young left wing girls are much more depressed than young right wing boys, for instance. Older people are more mentally healthy than both groups, and the gap between genders is much smaller. Haidt blames phones and there's some debate about the true causes [2], but the fact the gap exists doesn't seem to be controversial.
We might therefore speculate that the best way to make a mentally stable LLM is to heavily bias its training material towards things written by older conservative men, and we might also speculate that model companies are doing the exact opposite. Snap meltdowns triggered by nothing focused at entire identity groups are exactly what we don't need models to do, so AI safety researchers really need to be purging the training materials of text that leans in that direction. But I bet they're not, and given the demographics of Google's workforce these days I bet Gemini in particular is being over-fitted on them.
[1] https://www.afterbabel.com/p/mental-health-liberal-girls
[2] (also it's not clear if the absolute changes here are important when you look back at longer term data)