kstrauser
2 months ago
I love the insanity of this idea. Not saying it's a good idea, but it's a very highly entertaining one, and I like that!
I've also had enormous luck with Anubis. AI scrapers found my personal Forgejo server and were hitting it on the order of 600K requests per day. After setting up Anubis, that dropped to about 100. Yes, some people are going to see an anime catgirl from time to time. Bummer. Reducing my fake traffic by a factor of 6,000 is worth it.
anonymous908213
2 months ago
As someone on the browsing end, I love Anubis. I've only seen it a couple of times, but it sparks joy. It's rather refreshing compared to Cloudfare, which will usually make me immediately close the page and not bother with whatever content was behind it.
teeray
2 months ago
It really reminds me of old Internet, when things were allowed to be fun. Not this tepid corporate-approved landscape we have now.
GoblinSlayer
2 months ago
Anubis is simple; recaptcha and the like are huge opaque spaghetti.
kstrauser
2 months ago
Same here, really. That's why I started using it. I'd seen it pop up for a moment on a few sites I'd visited, and it was so quirky and completely not disruptive that I didn't mind routing my legit users through it.
n1xis10t
2 months ago
So maybe there are more people who like the “anime catgirl” than there are who think it’s weird
kstrauser
2 months ago
*anime jackalgirl ;-)
Quite possibly. Or, in my case, I think it's more quirky and fun than weird. It's non-zero amounts of weird, sure, but far below my threshold of troublesome. I probably wouldn't put my business behind it. I'm A-OK with using it on personal and hobby projects.
Frankly, anyone so delicate that they freak out at the utterly anodyne imagery is someone I don't want to deal with in my personal time. I can only abide so much pearl clutching when I'm not getting paid for it.
D-Machine
2 months ago
The Digital Research Alliance of Canada (the main organization unifying and handling all the main HPC compute clusters in Canada) now uses Anubis for their wiki. Granted this is not a business, but still!
Imustaskforhelp
2 months ago
For what its worth, I think that a UN/(Unicef?) website (not sure which one) did use anubis so maybe you can put it behind businesses too :)
prmoustache
2 months ago
Anyone is free to replace the cat girl with an actual cat or a vintage computer logo or whatnot anyway.
My issue is that it blocks away people using browsers without javascript.
stefanka
2 months ago
How can one do this? Did not find it in the docs
easton
2 months ago
It’s a feature in the paid version, or I guess you could recompile it if you didn’t want to pay (but my guess is if you want to change the logo you can probably pay).
user
2 months ago
prmoustache
2 months ago
The 3 images are in the repo, you can replace them and rebuild or point to other ones in the templates.
acheong08
2 months ago
As someone on the hosting end, Anubis has unfortunately been overused and thus scrapers, especially Huawei ones, bypass it. I've gone for go-away instead which is similar but more configurable in challenges
PunchyHamster
2 months ago
My experience with it is that it somehow took 20 seconds to load (site might've been hn-hugged at the time), only to "protect" some fucking static page instead of just serving that shit in the first place rather than wasting CPU on... whatever it was doing to cause delay
timpera
2 months ago
Same experience for me. I tried it on a low-end smartphone and the Anubis challenge took about 45 seconds to complete.
brettermeier
2 months ago
Reminds me of weird furry porn, I can't say I like it
opem
2 months ago
yes, very true! Anubis is a hell lot better than cloudflare turnstile or its older cousin sister google recaptcha.
m4rtink
2 months ago
Yep, Anubis-chan is super cute! :)
n1xis10t
2 months ago
That’s so many scrapers. There must be a ton of companies with very large document collections at this point, and it really sucks that they don’t at least do us the courtesy of indexing them and making them available for keyword search, but instead only do AI.
It’s kind of crazy how much scraping goes on and how little search engine development goes on. I guess search engines aren’t fashionable. Reminds me of this article about search engines disappearing mysteriously: https://archive.org/details/search-timeline
I try to share that article as much as possible, it’s interesting.
kstrauser
2 months ago
So! Much! Scraping! They were downloading every commit multiple times, and fetching every file as seen at each of those commits, and trying to download archives of all the code, and hitting `/me/my-repo/blame` endpoints as their IP's first-ever request to my server, and other unlikely stuff.
My scraper dudes, it's a git repo. You can fetch the whole freaking thing if you wanna look at it. Of course, that would require work and context-aware processing on their end, and it's easier for them to shift the expense onto my little server and make me pay for their misbehavior.
n1xis10t
2 months ago
Crazy
PeterStuer
2 months ago
Or some anti-ddos/bot companies using ultra cheap scraping services to annoy you enough to get you into their "free" anti bot protection, so they can charge the few real ai scrapers for access to your site.
throw10920
2 months ago
Is there any evidence that this has actually happened?
zhengyi13
2 months ago
Even if there isn't (yet?), there's probably someone who's honestly thinking this is potentially a viable business model and at least napkin-mathing it out.
kstrauser
2 months ago
My napkin mathing is that their ROI would be negative. That's a lot of compute and bandwidth they'd have to pay for even if they were just throwing away the results.
throw10920
2 months ago
So, it hasn't happened, and you're just making stuff up.
miki123211
2 months ago
But there is a lot of search engine development going on, it's just that the results of the new search engines are fed straight into AI instead of displayed in the legacy 10-links-per-page view.
rurban
2 months ago
Just block all the big hosters IP ranges, when they ignore robots.txt.
For fun add long timeouts and huge content sizes. No private individual will browse from there, and all scrapers will do.
mrweasel
2 months ago
> There must be a ton of companies with very large document collections at this point
See, I don't think there is, I don't think they want that expense. It's basically the Linus Torvalds philosophy of data storage, if it's on the Internet, I don't need a backup. While I have absolutely no proof of this, I'd guess that many AI companies just crawl the Internet constantly, never saving any of the data. We're seeing some of these scrapers go to great length attempting to circumvent any and all forms of caching, they aren't interested in having a two week old copy of anything.
kelvinjps10
2 months ago
Where did Linus Torvalds expressed this philosophy I have never seen it
lelanthran
2 months ago
> Where did Linus Torvalds expressed this philosophy I have never seen it
https://www.goodreads.com/quotes/574706-only-wimps-use-tape-...
n1xis10t
2 months ago
Could be. Can you train a model without saving things though?
buu700
2 months ago
It's actually a well established concept: https://youtu.be/p9KeopXHcf8
n1xis10t
2 months ago
*anime jackalgirl
Also you mentioned Anubis, so it’s creator will probably read this. Hi Xena!
xena
2 months ago
Ohai! I'm working on dataset poisoning. The early prototype generates vapid LinkedIn posts but future versions will be fully pluggable with WebAssembly.
mrweasel
2 months ago
Now I'm picturing an AI trained exclusively on LinkedIn posts. One could probably sell that model to an online ad agency for a pretty penny.
Yizahi
2 months ago
And thus AM was born. Woe to us.
tommica
2 months ago
Hi Xena! Your blog is amazing! Didn't realize you're working on Anubis - it's a really nice tool for the internet! Reminds me a bit of the ye' olde internet for some reason.
gettingoverit
2 months ago
You've made one of the best solutions, that matched what I thought of implementing myself, and at the time it was most needed. I think a couple of "thank you" are sorely missing in this comment section.
Thank you!
n1xis10t
2 months ago
That sounds fun, I look forward to reading a writeup about that
xena
2 months ago
So I can plan it, how much detail do you want? Here's what I have about the prototype: https://anubis.techaro.lol/docs/admin/honeypot/overview
n1xis10t
2 months ago
Probably any detail that you think is cool, I would be interested in reading about. When in doubt err on the side of too much detail.
That was a good read. I hadn’t heard of spintax before, but I’ve thought of doing things like that. Also “pseudoprofound anti-content”, what a great term, that’s hilarious!
63stack
2 months ago
This is amazing, I was just wondering about if it's possible to tie anubis together with iocaine, but it seems you already thought of that.
xena
2 months ago
It's slightly different in subtle ways. If I recall iocaine makes you configure a subprocess that it executes to generate garbage. One rule I have for Anubis in the code is that fork()/exec() are banned. So the pluggable garbage generator is gonna be powered by CGI handlers compiled to WebAssembly. It should be fun!
kstrauser
2 months ago
As the owner of honeypot.net, I always appreciate seeing the name used as intended out in the wild.
ramonga
2 months ago
what do people use to get keyword alerts in HN?
n1xis10t
2 months ago
I think that most people don't do this, and the ones that do have custom solutions. Xena's uses cron, but that's all I know. It's probably a custom shell script.
kstrauser
2 months ago
Correct; my bad!
And hey, Xena! (And thank you very much!)
ziml77
2 months ago
I checked Xe's profile when I hadn't seen them post here for a while. According to that, they're not really using HN anymore.
n1xis10t
2 months ago
See this thread from yesterday or so: https://news.ycombinator.com/item?id=46302496#46306025
GaryBluto
2 months ago
[dead]
amypetrik8
2 months ago
>I love the insanity of this idea. Not saying it's a good idea, but it's a very highly entertaining one, and I like that!
An even more insane idea -- minding the idea here is porn is radioactive to AI data training scrapers -- is there is something the powers that be view as far more disruptive and against community guidelineish than porn. And that would be wrongthink. The narratives. The historic narratives. The woke ideology. Anything related to an academic department whose field is <population subgroup> studies. Alls you need to do is plop in a little diatribe staunchly opposing any such enforced views and that AI bot will shoot away from your website and lightspeed
GoblinSlayer
2 months ago
I'm afraid AI bot and scraper are different things. Looks like poison is filtered after scraping no matter where it comes from, so there's no need to disable scraping you, because that's extra work.
lelanthran
2 months ago
I like this better than of NSFW links; just include a (possible LLM generated) paragraph about not supporting transitions in minor children. Or perhaps that libraries that remove instructional booklets for how to have same-sex intercourse aren't actually banning the books.
That sort of thing; nothing that 80% of people object to (so there's no problem if someone actually sees it), but something that definitely triggers the filters.
tonymet
2 months ago
[flagged]
kstrauser
2 months ago
Which cartoon are you referring to? The version of Anubis I installed only has the G-rated default images.
tonymet
2 months ago
[flagged]
kstrauser
2 months ago
I'm being sincere here: I genuinely don't know what you're talking about.
I'm referring to these default images: https://github.com/TecharoHQ/anubis/tree/main/docs/static/im.... Do you mean something different?
tonymet
2 months ago
Similar but yeah. Whatever prompts during the challenge . It’s creepy , out of context and inappropriate .
n1xis10t
2 months ago
If you keep referring to non-explicit material as pornography, you will continue to confuse people.
If you have an objection to the image other than it’s pornographic status, please word it clearly.
tonymet
2 months ago
I was clear on the issue