embedding-shape
28 minutes ago
> In our first experiment, we explained to the model the possibility that “thoughts” may be artificially injected into its activations, and observed its responses on control trials (where no concept was injected) and injection trials (where a concept was injected). We found that models can sometimes accurately identify injection trials, and go on to correctly name the injected concept.
Overview image: https://transformer-circuits.pub/2025/introspection/injected...
https://transformer-circuits.pub/2025/introspection/index.ht...
That's very interesting, and for me kind of unexpected.