LLM attacks take just 42 seconds on average, 20% of jailbreaks succeed

16 pointsposted 7 hours ago
by LinuxBender

7 Comments

aleph_minus_one

6 hours ago

> “In the near future, every application will be an AI application; that means that everything we know about security is changing,” Pillar Security CEO and Co-founder Dor Sarig told SC Media.

Quite the opposite: nothing we know about security is changing because of LLMs:

Everybody who is at least somewhat knowledable about security topics can tell you that adding some some AI chat(terbot) to anything security-related is a really bad idea. The only new thing about IT security that has changed is that such sound advice now becomes ignored because of the gold rush.

fsflover

3 hours ago

> Everybody who is at least somewhat knowledable about security topics can tell you

So nobody. And everyone will add chat bots. And the security will suffer, exactly as the article says.

andrewmcwatters

6 hours ago

It seems to me like you'd need a separate security LLM context from the primary context that simply screens attempts to jailbreak out of system prompts. Something that simply categorizes attempts and then rejects the text from ever even making it to the primary context, like a sandbox.

But there are much more informed ML people out there than me, so I assume this and similar techniques have already been thought of.

slowmovintarget

5 hours ago

No. You need inference running in the user's session, sandboxed from inference running for some other user. Doubly so for RAG or agent control where the LLM is expected to operate knobs and levers with the borrowed authority of the user.

The LLM is just generating orders of magnitude more user state as a result of a prompt. The actions the LLM is permitted to take must still be gated on the authorizations allowed for that user. This means that data unavailable to the user must not be in the training set of the LLM acting as an interface layer.

The "we have to learn new lessons" only applies when you lump all data together and hope the LLM doesn't spit up someone else's data from a probabilistic elbow jog. Hope is not a strategy.

doe_eyes

6 hours ago

Commercial LLMs generally have input and output filters to prevent "bad" prompts from reaching the model (instead returning canned text), or to nuke output if it appears to violate certain criteria.

But then, you have two independent mechanisms that can get out of sync, a classic source of issues in infosec - except both are also more or less inscrutable and fail in unexpected ways.

the-kenny

6 hours ago

With that, attackers need two gadgets instead of one. Where’s the end?

user

6 hours ago

[deleted]