> I agree. It always surprises me when people are indignant about scrapers ignoring robots.txt and throw around words like "theft" and "abuse."
This feels like the kind of argument some would make as to why they aren't required to return their shopping cart to the bay.
> robots.txt is a polite request to please not scrape these pages because it's probably not going to be productive. It was never meant to be a binding agreement, otherwise there would be a stricter protocol around it.
Well, no. That's an overly simplistic description which fits your argument, but doesn't accurately represent reality. yes, robots.txt is created as a hint for robots, a hint that was never expected to be non-binding, but the important detail, the one that is important to understanding why it's called robots.txt is because the web server exists to serve the requests of humans. Robots are welcome too, but please follow these rules.
You can tell your description is completely inaccurate and non-representative of the expectations of the web as a whole. because every popular llm scraper goes out of their way to both follow and announce that they follow robots.txt.
> It's kind of like leaving a note for the deliveryman saying please don't leave packages on the porch.
It's nothing like that, it's more like a note that says no soliciting, or please knock quietly because the baby is sleeping.
> It's fine for low stakes situations, but if package security is of utmost importance to you, you should arrange to get it certified or to pick it up at the delivery center.
Or, people could not be assholes? Yes, I get it, the reality we live in there are assholes. But the problem as I see it, is not just the assholes, but the people who act as apologists for this clearly deviant behavior.
> Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.
Because it's your fault if you don't, right? That's victim blaming. I want to be able to host free, easy to access content for humans, but someone with more money, and more compute resources than I have, gets to overwhelm my server because they don't care... And that's my fault, right?
I guess that's a take...
There's a huge difference between suggesting mitigations for dealing with someone abusing resources, and excusing the abuse of resources, or implying that I should expect my server to be abused, instead of frustrated about the abuse.