bobbiechen
9 hours ago
>We’ve already done the work to render the page, and we’re trying to shed load, so why would I want to increase load by generating challenges and verifying responses? It annoys me when I click a seemingly popular blog post and immediately get challenged, when I’m 99.9% certain that somebody else clicked it two seconds before me. Why isn’t it in cache? We must have different objectives in what we’re trying to accomplish. Or who we’re trying to irritate.
+1000 I feel like so much bot detection (and fraud prevention against human actors, too) is so emotionally-driven. Some people hate these things so much, they're willing to cut off their nose to spite their face.
bayindirh
7 hours ago
My view on this is simple:
If you're a bot which will ignore all the licenses I put on that content, then I don't want to you to be able to reach that content.
No, any amount of monetary compensation is not welcome either. I use these licenses as a matter of principle, and my principles are not for sale.
That's all, thanks.
beeflet
6 hours ago
I think the problem is that despite the effort, you will still end up in the dataset. So it's futile
warkdarrior
7 hours ago
How can you tell a bot will ignore all your content licenses?
bayindirh
7 hours ago
Currently all AI companies argue that the content they use falls under fair use, and disregard all licenses. This means any future ones respecting these licenses needs to be whitelisted.
diggan
6 hours ago
How do you know that that bot is part of those AI companies? Maybe it's my personal bot you're blocking, should I also not have (indirectly) access to the content?
simianparrot
6 hours ago
No. Access to my content is a privilege I grant you. I decide how you get to access it, and via a bot that my setup confuses for an AI crawler belonging to an anti-human AI corporation is not a valid way to access it. Get off my virtual lawn.
diggan
6 hours ago
> No. Access to my content is a privilege I grant you.
Right, I thought the conversation was about public websites on the public internet, but I think you're talking about this in the context of a private website now? I understand keeping tighter controls if you're dealing with private content you want accessible via the internet for others but not the public.
privatelypublic
5 hours ago
All websites are private (excepting maybe government sites). In most places the internet infrastructure itself is private.
You're conflating a legal concept that applies to areas that are shared, government owned, paid for by taxes, and the government feels like people should be able to access them.
The web is closer to a shopping mall. You're on one persons property to access other people's stuff who pay to be there. They set their own rules. If you don't follow those rules you get kicked out, charged with trespassing, and possibly banned from the mall entire.
AI bots have been asked to leave. But, since they own the mall too, the store owners are more than a little screwed.
simianparrot
5 hours ago
You’re literally visiting a service paid for by me. It’s open to the public, but it’s my domain and my server and I get to say “no thank you” to your visit if you don’t behave. You have no innate right to access the content I share.
Blocking misbehaving IP addresses isn’t new, and is another version of the same principle.
bayindirh
6 hours ago
This interpretation won't take you that far.
Crawling-prevention is not new. Many news outlets or biggish websites already was preventing access by non-human agents in various ways for a very long time.
Now, non-human agents are improved and started to leech everything they can find, so the methods are evolving, too.
News outlets are also public sites on the public internet.
Source-available code repositories are also on the public internet, but said agents crawl and use that code, too, backed by fair-use claims.
bayindirh
6 hours ago
You can use a honest user string denoting that it's your bot. Some AI companies label their bots transparently, they show up on the logs I keep.
While I understand that you may need a personal bot to crawl or mirror a site, I can't guarantee that I'll grant you access.
I don't like to be that heavy-handed in the first place, but capitalism is making it harder to trust entities which you can't see and talk face to face.
Vegenoid
6 hours ago
I think it’s better viewed through a lens of effort. Implementing systems that try harder to not challenge humans takes more work than just throwing up a catch-all challenge wall.
The author’s goal is admirable: “My primary principle is that I’d rather not annoy real humans more than strictly intended”. However, the primary goal for many people hosting content will be “block bots and allow humans with minimal effort and tuning”.
jitl
8 hours ago
Really? If I’m an unsophisticated blog not using a CDN, and I get a $1000 bill for bandwidth overage or something, I’m gonna google a solve and slap it on there because I don’t want to pay another $1000 for Big Basilisk. I don’t think that’s emotional response, it’s common sense.
marginalia_nu
7 hours ago
Seems like you've made profoundly questionable hosting or design choices for that to happen. Flat rate web hosting exists, and blogs (especially unsophisticated ones) do not require much bandwidth or processing power.
Misbehaving crawlers are a huge problem but bloggers are among the least affected by them. Something like a wiki or a forum is a better example, as they're in a category of websites where each page visit is almost unavoidably rendered on the fly using multiple expensive SQL queries due to the rapidly mutating nature of their datasets.
Git forges, like the one TFA is discussing, are also fairly expensive, especially as crawlers traverse historical states. When the crawler is poorly implemented they'll get stuck doing this basically forever. Detecting and dealing with git hosts is an absolute must for any web crawler due to this.
mtlynch
7 hours ago
>Flat rate web hosting exists, and blogs (especially unsophisticated ones) do not require much bandwidth or processing power.
I actually find this surprisingly difficult to find.
I just want static hosting (like Netlify or Firebase Hosting), but there aren't many hosts that offer that.
There are lots of providers where I can buy a VPS somewhere and be in charge of configuring and patching it, but if I just want to hand someone a set of HTML files and some money in exchange for hosting, not many hosts fit the bill.
diggan
6 hours ago
> There are lots of providers where I can buy a VPS somewhere and be in charge of configuring and patching it, but if I just want to hand someone a set of HTML files and some money in exchange for hosting, not many hosts fit the bill.
Yeah, that's true, there isn't a lot of "I give you money and HTML, you host it" services out there, surprisingly. Probably the most mature, cheapest and most reliable one today would be good ol' neocities.org (run by HN user kyledrake) which basically gives you 3TB/month for $5, pretty good deal :)
Sometimes when I miss StumbleUpon I go to https://neocities.org/browse?sort_by=random which gives a fun little glimpse of the hobby/curiosity/creative web.
ghssds
5 hours ago
You already had a couple of suggestions but I've been happy in the past with OVH.
thaumaturgy
6 hours ago
Interesting, I was under the impression this was more common than maybe it is. I know the hosting market has gotten pretty bad.
So, I'm currently building pretty much this. After doing it on the side for clients for years, it's now my full-time effort. I have a solid and stable infrastructure, but not yet an API or web frontend. If somebody wants basically ssh, git, and static (or even not static!) hosting that comes with a sysadmin's contact information for a small number of dollars per month, I can be reached at sysop@biphrost.net.
Environment is currently Debian-in-LXC-on-Debian-on-DigitalOcean.
ctoth
6 hours ago
> There are lots of providers where I can buy a VPS somewhere and be in charge of configuring and patching it, but if I just want to hand someone a set of HTML files and some money in exchange for hosting, not many hosts fit the bill.
Dreamhost! They're still around and still lovely after how many years? I even find their custom control panel charming.
hobs
6 hours ago
I really like DH(though I am still mad about the cloudatcost shenanigans) and use them but if you use 200x the resources the other shared sites consume you're getting the boot just like anyone.
marginalia_nu
6 hours ago
If you just want to host HTML for personal use github pages is free (and works with a custom domain). There are bandwidth limitations, but they definitely won't pull an AWS on you and send a bill that would cover a new car because a crawler acted up.
phantompeace
7 hours ago
Wouldn't it be easier to put the unsophisticated blog behind cloudflare
mhuffman
7 hours ago
As much as I like to shit on cloudflare at every opportunity, it would obviously be easier to put it behind CF than install bot detection plugins.