the_harpia_io
13 hours ago
This is cool - the ~70% success rate on basic attacks tracks with what I've seen. Most agent frameworks just pipe raw text through without any sanitization because "it's just summarizing a page, what could go wrong."
The screenshot approach nate mentions is interesting but feels like trading one problem for another. You're immune to text injection but now vulnerable to visual tricks - misleading rendered text, fake UI elements, those unicode lookalike characters that render identically but have different meanings.
Curious if you've tested any agents that do pre-processing on the HTML - like stripping invisible elements, normalizing unicode, etc - before passing to the model. That's the approach I've seen in a few internal tools but haven't benchmarked how effective it actually is against multi-layer attacks like yours.
nate
12 hours ago
sorry. i didn't mean to say that's the only thing this agent is doing is screenshotting. just that it was a thing my agent is doing which has this neat property. i also have a host of other things going on when it does need to grab and understand the contents of the page. the screenshot is used in conjunction with the html to navigate and find things. but it's also doing things this particular test tries (hidden divs, aria=hidden, etc.). also tries to message the model about what's trusted and untrusted.
but the big thing I have in here is simply a cross domain check. if the domain is about to be navigated away from, we alert the user to changing domains. this is all in a browser context too so a browsers csrf protection is also being relied on. but its the cross domain navigation i'm really worried about. and trying to make sure i've gotten super hardened. but this is the trickiest part in a browser admittedly. i feel like browsers are going to need a new "non-origin" kind of flow that knows an agent is browsing and does something like blocking and confirming natively.
the_harpia_io
11 hours ago
The cross-domain check makes sense as the priority - that's where the real risk is. Injection making the agent do something dumb on the same site is bad, but redirecting to an attacker-controlled domain is way worse. Exfil via URL params, tokens in redirects, all that.
Your browser-native agent mode idea is interesting. Something like CSP but for navigation intent - "this agent can only interact with *.myapp.com" - and it's declarative so the injection can't social-engineer its way around it. Though browser vendors are probably 2-3 years behind on this stuff. Agent frameworks will have to solve it themselves first and then maybe Chrome picks it up later once there's consensus.
joozio
13 hours ago
Haven't benchmarked pre-processing approaches yet, but that's a natural next step. Right now the test page targets raw agent behavior — no middleware. A comparison between raw vs sanitized pipelines against the same attacks would be really useful. The multi-layer attack (#10) would probably be the hardest to strip cleanly since it combines structural hiding with social engineering in the visible text.
the_harpia_io
13 hours ago
Yeah, the social engineering + structural combination is brutal to defend against. You can strip the technical hiding but the visible prompt injection still works on the model. Would be interesting to see how much of the ~70% success rate drops with just basic sanitization (strip comments, normalize whitespace, remove zero-width) vs more aggressive stripping.
If you build out a v2 with middleware testing, a leaderboard by framework would be killer. "How manipulation-proof is [Langchain/AutoGPT/etc] out of the box vs with basic defenses" would get a lot of attention.