kstrauser
a day ago
I love the insanity of this idea. Not saying it's a good idea, but it's a very highly entertaining one, and I like that!
I've also had enormous luck with Anubis. AI scrapers found my personal Forgejo server and were hitting it on the order of 600K requests per day. After setting up Anubis, that dropped to about 100. Yes, some people are going to see an anime catgirl from time to time. Bummer. Reducing my fake traffic by a factor of 6,000 is worth it.
anonymous908213
a day ago
As someone on the browsing end, I love Anubis. I've only seen it a couple of times, but it sparks joy. It's rather refreshing compared to Cloudfare, which will usually make me immediately close the page and not bother with whatever content was behind it.
teeray
a day ago
It really reminds me of old Internet, when things were allowed to be fun. Not this tepid corporate-approved landscape we have now.
kstrauser
a day ago
Same here, really. That's why I started using it. I'd seen it pop up for a moment on a few sites I'd visited, and it was so quirky and completely not disruptive that I didn't mind routing my legit users through it.
n1xis10t
a day ago
So maybe there are more people who like the “anime catgirl” than there are who think it’s weird
kstrauser
a day ago
*anime jackalgirl ;-)
Quite possibly. Or, in my case, I think it's more quirky and fun than weird. It's non-zero amounts of weird, sure, but far below my threshold of troublesome. I probably wouldn't put my business behind it. I'm A-OK with using it on personal and hobby projects.
Frankly, anyone so delicate that they freak out at the utterly anodyne imagery is someone I don't want to deal with in my personal time. I can only abide so much pearl clutching when I'm not getting paid for it.
D-Machine
18 hours ago
The Digital Research Alliance of Canada (the main organization unifying and handling all the main HPC compute clusters in Canada) now uses Anubis for their wiki. Granted this is not a business, but still!
Imustaskforhelp
a day ago
For what its worth, I think that a UN/(Unicef?) website (not sure which one) did use anubis so maybe you can put it behind businesses too :)
prmoustache
20 hours ago
Anyone is free to replace the cat girl with an actual cat or a vintage computer logo or whatnot anyway.
My issue is that it blocks away people using browsers without javascript.
stefanka
19 hours ago
How can one do this? Did not find it in the docs
easton
18 hours ago
It’s a feature in the paid version, or I guess you could recompile it if you didn’t want to pay (but my guess is if you want to change the logo you can probably pay).
prmoustache
13 hours ago
The 3 images are in the repo, you can replace them and rebuild or point to other ones in the templates.
acheong08
a day ago
As someone on the hosting end, Anubis has unfortunately been overused and thus scrapers, especially Huawei ones, bypass it. I've gone for go-away instead which is similar but more configurable in challenges
opem
11 hours ago
yes, very true! Anubis is a hell lot better than cloudflare turnstile or its older cousin sister google recaptcha.
brettermeier
16 hours ago
Reminds me of weird furry porn, I can't say I like it
PunchyHamster
a day ago
My experience with it is that it somehow took 20 seconds to load (site might've been hn-hugged at the time), only to "protect" some fucking static page instead of just serving that shit in the first place rather than wasting CPU on... whatever it was doing to cause delay
timpera
20 hours ago
Same experience for me. I tried it on a low-end smartphone and the Anubis challenge took about 45 seconds to complete.
m4rtink
a day ago
Yep, Anubis-chan is super cute! :)
n1xis10t
a day ago
That’s so many scrapers. There must be a ton of companies with very large document collections at this point, and it really sucks that they don’t at least do us the courtesy of indexing them and making them available for keyword search, but instead only do AI.
It’s kind of crazy how much scraping goes on and how little search engine development goes on. I guess search engines aren’t fashionable. Reminds me of this article about search engines disappearing mysteriously: https://archive.org/details/search-timeline
I try to share that article as much as possible, it’s interesting.
kstrauser
a day ago
So! Much! Scraping! They were downloading every commit multiple times, and fetching every file as seen at each of those commits, and trying to download archives of all the code, and hitting `/me/my-repo/blame` endpoints as their IP's first-ever request to my server, and other unlikely stuff.
My scraper dudes, it's a git repo. You can fetch the whole freaking thing if you wanna look at it. Of course, that would require work and context-aware processing on their end, and it's easier for them to shift the expense onto my little server and make me pay for their misbehavior.
n1xis10t
a day ago
Crazy
miki123211
a day ago
But there is a lot of search engine development going on, it's just that the results of the new search engines are fed straight into AI instead of displayed in the legacy 10-links-per-page view.
PeterStuer
a day ago
Or some anti-ddos/bot companies using ultra cheap scraping services to annoy you enough to get you into their "free" anti bot protection, so they can charge the few real ai scrapers for access to your site.
throw10920
18 hours ago
Is there any evidence that this has actually happened?
zhengyi13
15 hours ago
Even if there isn't (yet?), there's probably someone who's honestly thinking this is potentially a viable business model and at least napkin-mathing it out.
throw10920
2 hours ago
So, it hasn't happened, and you're just making stuff up.
kstrauser
13 hours ago
My napkin mathing is that their ROI would be negative. That's a lot of compute and bandwidth they'd have to pay for even if they were just throwing away the results.
mrweasel
a day ago
> There must be a ton of companies with very large document collections at this point
See, I don't think there is, I don't think they want that expense. It's basically the Linus Torvalds philosophy of data storage, if it's on the Internet, I don't need a backup. While I have absolutely no proof of this, I'd guess that many AI companies just crawl the Internet constantly, never saving any of the data. We're seeing some of these scrapers go to great length attempting to circumvent any and all forms of caching, they aren't interested in having a two week old copy of anything.
kelvinjps10
16 hours ago
Where did Linus Torvalds expressed this philosophy I have never seen it
lelanthran
15 hours ago
> Where did Linus Torvalds expressed this philosophy I have never seen it
https://www.goodreads.com/quotes/574706-only-wimps-use-tape-...
n1xis10t
16 hours ago
Could be. Can you train a model without saving things though?
n1xis10t
a day ago
*anime jackalgirl
Also you mentioned Anubis, so it’s creator will probably read this. Hi Xena!
xena
a day ago
Ohai! I'm working on dataset poisoning. The early prototype generates vapid LinkedIn posts but future versions will be fully pluggable with WebAssembly.
mrweasel
21 hours ago
Now I'm picturing an AI trained exclusively on LinkedIn posts. One could probably sell that model to an online ad agency for a pretty penny.
Yizahi
21 hours ago
And thus AM was born. Woe to us.
tommica
a day ago
Hi Xena! Your blog is amazing! Didn't realize you're working on Anubis - it's a really nice tool for the internet! Reminds me a bit of the ye' olde internet for some reason.
gettingoverit
a day ago
You've made one of the best solutions, that matched what I thought of implementing myself, and at the time it was most needed. I think a couple of "thank you" are sorely missing in this comment section.
Thank you!
n1xis10t
a day ago
That sounds fun, I look forward to reading a writeup about that
xena
a day ago
So I can plan it, how much detail do you want? Here's what I have about the prototype: https://anubis.techaro.lol/docs/admin/honeypot/overview
n1xis10t
a day ago
Probably any detail that you think is cool, I would be interested in reading about. When in doubt err on the side of too much detail.
That was a good read. I hadn’t heard of spintax before, but I’ve thought of doing things like that. Also “pseudoprofound anti-content”, what a great term, that’s hilarious!
63stack
21 hours ago
This is amazing, I was just wondering about if it's possible to tie anubis together with iocaine, but it seems you already thought of that.
xena
19 hours ago
It's slightly different in subtle ways. If I recall iocaine makes you configure a subprocess that it executes to generate garbage. One rule I have for Anubis in the code is that fork()/exec() are banned. So the pluggable garbage generator is gonna be powered by CGI handlers compiled to WebAssembly. It should be fun!
kstrauser
a day ago
As the owner of honeypot.net, I always appreciate seeing the name used as intended out in the wild.
ramonga
a day ago
what do people use to get keyword alerts in HN?
n1xis10t
16 hours ago
I think that most people don't do this, and the ones that do have custom solutions. Xena's uses cron, but that's all I know. It's probably a custom shell script.
kstrauser
a day ago
Correct; my bad!
And hey, Xena! (And thank you very much!)
ziml77
a day ago
I checked Xe's profile when I hadn't seen them post here for a while. According to that, they're not really using HN anymore.
n1xis10t
a day ago
See this thread from yesterday or so: https://news.ycombinator.com/item?id=46302496#46306025
buu700
a day ago
It's actually a well established concept: https://youtu.be/p9KeopXHcf8
amypetrik8
16 hours ago
>I love the insanity of this idea. Not saying it's a good idea, but it's a very highly entertaining one, and I like that!
An even more insane idea -- minding the idea here is porn is radioactive to AI data training scrapers -- is there is something the powers that be view as far more disruptive and against community guidelineish than porn. And that would be wrongthink. The narratives. The historic narratives. The woke ideology. Anything related to an academic department whose field is <population subgroup> studies. Alls you need to do is plop in a little diatribe staunchly opposing any such enforced views and that AI bot will shoot away from your website and lightspeed
lelanthran
15 hours ago
I like this better than of NSFW links; just include a (possible LLM generated) paragraph about not supporting transitions in minor children. Or perhaps that libraries that remove instructional booklets for how to have same-sex intercourse aren't actually banning the books.
That sort of thing; nothing that 80% of people object to (so there's no problem if someone actually sees it), but something that definitely triggers the filters.