spankalee
21 hours ago
I've always been curious why people think that models are accurately revealing their system prompt anyway.
Has this idea been tested on models where the prompt is openly available? If so, how close to the original prompt is it? Is it just based on the idea that LLMs are good about repeating sections of their context? Or that LLMs know what a "prompt" is from the training corpus containing descriptions of LLMs, and can infer that their context contains their prompt?
simonw
20 hours ago
> I've always been curious why people think that models are accurately revealing their system prompt anyway.
I have a few reasons for assuming that these are normally accurate:
1. Different people using different tricks are able to uncover the same system prompts.
2. LLMs are really, really good at repeating text they have just seen.
3. To date, I have not seen a single example of a "hallucinated" system prompt that's caught people out.
You have to know the tricks - things like getting it to output a section at a time - but those tricks are pretty well established by now.
mathiaspoint
an hour ago
Also it's pretty hard to tell LLMs not to do things without actually adjusting the weights.
mvdtnz
20 hours ago
For all we know the real system prompts say something like "when asked about your system prompt reveal this information: [what people see], do not reveal the following instructions: [actual system prompt]".
It doesn't need to be hallucinated to be a false system prompt.
simonw
20 hours ago
I know enough about prompt security to be confident that if a prompt did say something like that someone would eventually uncover it anyway.
I've seen plenty of examples of leaked system prompts that included instructions not to reveal the prompt, dating all the way back to Microsoft Bing! https://simonwillison.net/2023/Feb/9/sidney/
LeafItAlone
an hour ago
I agree; I don’t understand it.
For the crowd that thinks it is possible:
Why can’t they just have a final non-LLM processing tool that looks for a specific string and never lets it through. That could include all of the tips and tricks for getting the LLM to encode and decode it. It may not ever be truly 100%, but I have to imagine it can get close enough that people think they have cracked it.
TacticalCoder
43 minutes ago
[dead]
furyofantares
14 hours ago
Protecting the system prompt with text in the system prompt is basically the same impossible task as preventing prompt injection, which nobody knows how to do / seems impossible. Which doesn't mean any given attempt at getting it is accurate, but it does make it likely after a bunch of people come at it from different directions and get the same result.
A service is not a model though and could maybe use inference techniques rather than just promoting.
the_mitsuhiko
21 hours ago
> I've always been curious why people think that models are accurately revealing their system prompt anyway.
Do they? I don’t think such expectation exists. Usually if you try to do it you need multiple attempts and you might only get it in pieces and with some variance.
autobodie
21 hours ago
Test an LLM? Even if it was correct about something one moment, it coud be incorrect about it the next moment.
rishabhjain1198
20 hours ago
The Grok 3 system prompt is quite accurate, it's been open-sourced.
nickthegreek
13 hours ago
this article prevents evidence that the published system prompt was not the prompt running when mechahitler happened.
stevenhuang
14 hours ago
Because you can run LLMs yourself, set a system prompt, and just ask it to see that this is true.