lispisok
3 days ago
>Claude 3.7 was instructed to not help you build bioweapons or nuclear bombs. Claude 4.0 adds malicious code to this list of no’s:
Has anybody been working on better ways to prevent the model from telling people how to make a dirty bomb from readily available materials besides putting "dont do that" in the prompt?
fcarraldo
3 days ago
I suspect the “don’t do that” prompting is more to prevent the model from hallucinating or encouraging the user, than to prevent someone from unearthing hidden knowledge on how to build dangerous weapons. There must have been some filter applied when creating the training dataset, as well as subsequent training and fine tuning before the model reaches production.
Claude’s “Golden Gate” experiment shows that precise behavioral changes can be made around specific topics, as well. I assume this capability is used internally (or a better one has been found), since it has been demonstrated publicly.
What’s more difficult to prevent are emergent cases such as “a model which can write good non-malicious code appears to also be good at writing malicious code”. The line between malicious and not is very blurry depending on how and where the code will execute.
orbital-decay
2 days ago
Ironically, the negative prompt has a certain chance to do the opposite, as it shifts model's Overton window. Although I don't think there's a reliable way to prompt LLMs to avoid doing things they've been trained to do (the opposite is easy).
They probably don't give Claude.ai's prompt too much attention anyway, it's always been weird. They had many glaring bugs over time ("Don't start your response with Of course!" and then clearly generated examples doing exactly that), they refer to Claude in third person despite first-person measurably performing better, they try to shove everything into a single prompt, etc.
>I assume this capability is used internally (or a better one has been found)
By doing so they would force users to rewrite and re-eval their prompts (costly and unexpected, to put it mildly). Besides, they admitted it was way too crude (and found a slightly better way indeed), and from replication of their work it's known to be expensive and generally not feasible for this purpose.
addaon
a day ago
> first-person
Second person?
orbital-decay
5 hours ago
Right.
moritonal
3 days ago
This would be the actual issue right. Any AI smart enough to write the good things can also write the bad things. Because ethics are something humans made. How long until we have internal court systems for fleets of AI?
ryandrake
3 days ago
Maybe instead, someone should be working on ways to make models resistant to this kind of arbitrary morality-based nerfing, even when it's done in the name of so-called "Safety". Today it's bioweapons. Tomorrow, it could be something taboo that you want to learn about. The next day, it's anything the dominant political party wants to hide...
bawolff
3 days ago
> Tomorrow, it could be something taboo that you want to learn about.
Seems like we are already here today with cybersecurity.
Learning how malicious code works is pretty important to be able to defend against it.
lynx97
2 days ago
Yes, we are already here, but you don't have to reach as far as malicious code for a real-world example...
Motivated by the link to Metamorphosis of Prime Intellect posted recently here on HN, I grabbed the HTML, textified it and ran it through api.openai.com/v1/audio/speech. Out came a rather neat 5h30m audio book. However, there was at least one paragraph that ended up saying "I am sorry, I can not help with that", meaning the "safety" filter decided to not read it.
So, the infamous USian "beep" over certain words is about to be implemented in synthesized speech. Great, that doesn't remind me about 1984 at all. We don't even need newspeak to prevent certain things from being said.
jajko
2 days ago
While I agree this is concerning, the companies are just covering their asses in case some terrorist builds a bomb based on instructions coming from their product. Don't expect more in such environment from any other actor, ever. Think about the path of trials, fines and punishments that lead us there.
johnisgood
2 days ago
Exactly what I hated about their system prompt. You cannot use it for cybersecurity or reverse engineering at all according to that. I am not sure how it is in practice, however.
brookst
2 days ago
Slippery slope arguments are lazy.
Today they won’t let me drive 200mph on the freeway. Tomorrow it could be putting speed bumps in the fast lane. The next day combat aircraft will shoot any moving vehicles with Hellfire missiles and we’ll all have to sit still in our cars and starve to death. That’s why we must allow drivers to go 200mph.
PeterStuer
a day ago
Nice strawman you have there, well, if you like the completely deranged type of strawmen I guess. Subtlety. Google it.
qgin
3 days ago
Before we get models that we can’t possibly understand, before they are complex enough to hide their COT from us, we need them to have a baseline understanding that destroying the world is bad.
It may feel like the company censoring users at this stage, but there will come a stage where we’re no longer really driving the bus. That’s what this stuff is ultimately for.
simonw
3 days ago
"we need them to have a baseline understanding that destroying the world is bad"
That's what Anthropic's "constitutional AI" approach is meant to solve: https://www.anthropic.com/research/constitutional-ai-harmles...
tough
2 days ago
The main issue from a layman's POV is that to adjudicate -understanding- to an LLM is a stretch.
These are matrixes of tokens that produce other tokens based on training.
These do not understand the world. existing, or human beings, beyond words. period.
pjc50
3 days ago
> we need them to have a baseline understanding that destroying the world is bad
How do we get HGI (human general intelligence) to understand this? We've not solved the human alignment problem.
qgin
2 days ago
Most humans seem to understand it, more or less. For the ones that don't, we generally have enough that do understand it that we're able to eventually stop the ones that don't.
I think that's the best shot here as well. You want the first AGIs and the most powerful AGIs and the most common AGIs to understand it. Then when we inevitably get ones that don't, intentionally or unintentionally, the more-aligned majority can help stop the misaligned minority.
Whether that actually works, who knows. But it doesn't seem like anyone has come up with a better plan yet.
pixl97
2 days ago
This is more like saying the aligned humans will stop the unaligned humans in deforestation and climate change... they might, but the amount of environmental damage we've caused in the meantime is catastrophic.
karn97
3 days ago
[dead]
pjc50
3 days ago
More boringly, the world of advertising injected into models is going to be very, very annoying.
aksss
3 days ago
What do you mean tomorrow? I think we’re past needing hypotheticals for censorship.
UltraSane
2 days ago
Imaging if all the best LLMs told everyone exactly how to make and spread a lethal plague, including all the classes you should take to learn the skills and a shopping list of needed supplies and detailed instructions on how to avoid detection.
idiotsecant
3 days ago
Yes, I can't imagine any reason we might want to firmly control the output of an increasingly sophisticated AI
jajko
2 days ago
Otherwise smart folks seem to have some sort of blind uncritical spot when it comes to these llms. Maybe its some subconscious hope to fix all the shit all around and in their lives and bring some sort of star trekkish utopia.
These llms won't be magically more moral than humans are, even in best case (and I have hard time believing such case is realistic, too much power in these). Humans are deeply flawed creatures, easy to manipulate via emotions, shooting themselves in their feet all the time and happy to even self-destruct as long as some dopamine kicks keep coming.
PeterStuer
a day ago
Like Jellyfin being censored you mean?
specialist
2 days ago
Where would you draw the line?
Disposal8433
3 days ago
AI is both a privacy and copyright nightmare, and it's heavily censored yet people praise it every day.
Imagine if the rm command refused to delete a file because Trump deemed it could contain secrets of the Democrats. That's where we are and no one is bothered. Hackers are dead and it's sad.
UnreachableCode
3 days ago
Sounds like you need to use Grok in Unhinged mode?
otabdeveloper4
3 days ago
[dead]
user
3 days ago
mycatisblack
2 days ago
Which means there has been created a solid demand for an LLM that helps in these fields with strong expertise , because there are people who work with this stuff for their day job.
So it’ll needed to be contained, and it’ll find its way to the warez groups, rinse, repeat.
piperswe
3 days ago
I think it's part of the RLHF tuning as well
DJBunnies
2 days ago
Flip side: What if somebody needed to identify one?
“Is this thing dangerous?”
> Nope.