miladyincontrol
7 days ago
A lot of less scrupulous crawlers just seem to imitate the big ones. I feel a lot of people make assumptions because the user agent has to be true, right?
My fave method is still just to have bait info in robots.txt that gzip bombs and autoblocks all further requests from them. Was real easy to configure in Caddy and tends to catch the worst offenders.
Not excusing the bot behaviours but if a few bots blindly take down your site, then an intentionally malicious offender would have a field day.
horsawlarway
6 days ago
Your last comment feels pretty apt.
Maybe I'm just a different generation than the folks writing these blog posts, but I really don't understand the fixation on such low resource usage.
It's like watching a grandparent freak out over not turning off an LED light or seeing them drive 15 miles to save 5c/gallon on gas.
20 requests per second is just... Nothing.
Even if you're dynamically generating them all (and seriously... Why? Time would have been so much better spent fixing that with some caching than this effort) it's just not much demand.
I get the "fuck the bots" style posts are popular in the Zeitgeist at the moment, but this is hardly novel.
There are a lot more productive ways to handle this that waste a lot less of your time.
whoisyc
6 days ago
1. I fear you may be underestimating the volume of bot traffic websites are now receiving. I recommend reading this article to get an idea of the scale of the problem: https://thelibre.news/foss-infrastructure-is-under-attack-by...
2. Not all requests are created equal. 20 requests a second for the same static HTML file? No problem. But if you have, say, a release page for an open source project with binary download links for all past versions for multiple platforms, each one being a multi megabyte blob, and a scraper starts hitting these links, you will run into bandwidth problems very quickly, unless you live in a utopia where bandwidth is free.
3. You are underestimating the difficulty of caching dynamic pages. Cache invalidation is hard, they say. One notably problematic example is Git blames. So far I am not aware of any existing solution for caching blames, and jury rigging your own will likely not be any easier than the “solution” explored in the TFA.
hartator
6 days ago
> 2. Not all requests are created equal. 20 requests a second for the same static HTML file? No problem. But if you have, say, a release page for an open source project with binary download links for all past versions for multiple platforms, each one being a multi megabyte blob, and a scraper starts hitting these links, you will run into bandwidth problems very quickly, unless you live in a utopia where bandwidth is free.
All of this is (and should) cached on a cdn. You can go 1000 QPS on this in that config.
tdeck
6 days ago
I think a person should be able to set up a small website on a niche topic without ever having to set up a CDN. Until very recently this was the case, so it's sad to see that simplicity go away purely to satisfy the data needs of shady scrapers.
busymom0
6 days ago
Shouldn't such big blogs be put on something like CloudFlare R2 or BackBlaze or even S3 with their caching in front? Instead of having your server handle such file downloads.
phantomathkg
6 days ago
Cache also cost money as well. Nothing is free.
user
5 days ago
charcircuit
6 days ago
>you live in a utopia where bandwidth is free.
It's called peering agreements and they are very common. There's a reason social media and sites like YouTube, Twitch, TikTok don't immediately go out of business. The bandwidth is free for most users.
chii
6 days ago
there's only a handful of entities in the world that are capable of peering. Most people have to pay for their bandwidth.
charcircuit
6 days ago
It can be done by whoever is providing you internet connectivity. Not everywhere adds extra charges for bandwith.
spookie
6 days ago
A friend of mine had over 1000 requests/sec on his Gitea at peaks. Also, you aren't taking into account some of us don't have a "server", just some shitbox computer in the basement.
This isn't about mere dozen requests. It gets pretty bad. It also slows down his life.
jdboyd
5 days ago
The "shitbox computer in the basement" is something I would call a server. I mean, it is more capable than most VPSs (except in upload speed to the Internet).
vladvasiliu
6 days ago
I sympathize with the general gist of your post, but I've seen many a bot generate more traffic than legitimate users on our site.
Never had any actual performance issue, but I can see why a site that expects generally a very low traffic rate might freak out. Could they better optimize their sites? Probably, I know ours sucks big time. But in the era of autoscaling workloads on someone else's computer, a misconfigured site could rack up a big ass bill.
eGQjxkKF6fif
6 days ago
It's not fuck the bots, it's fuck the bot owners for using the websites as they want, and not at minimum, asking. Like 'hey cool if I use this tool to interact with your site for this and that reason?'
No, they just do it. So that can scrape data, which at this point in time for AI which has hit the cap on what it can consume knowledge wise, scrapes it because live updates and new information is most valuable to them.
So they will find tricky, evil ways to hammer resources that we as site operators own; even minimally to use site data to their profit, their success, their benefits while blatantly saying 'screw you' as they ignore robots.txt or pretend to be legitimate users.
There's a digital battle field going on. Clients are coming in as real users using IP lists like from https://infatica.io/
A writeup posted to HN about it
https://jan.wildeboer.net/2025/04/Web-is-Broken-Botnet-Part-...
A system and site operator has every right to build the tools they want to protect their systems, data, and have a user experience that benefits their audiences.
Your points are valid and make sense, but; it's not about that. It's about valuing authentic works, intellectual properties, and some dweeb that wants to steal it doesn't get to just run their bots against resources at others detriments, and their benefits.
eadmund
6 days ago
> Like 'hey cool if I use this tool to interact with your site for this and that reason?'
They do ask: they make an HTTP request. How the server responds to that request is up to the owner. As in the article, the owner can decide to respond to that request however he likes.
I think that a big part of the issue is that software is not well-written. If you think about it, even the bots constantly requesting tarballs for git commits doesn’t have to destroy the experience of using the system for logged-in users. One can easily imagine software which prioritises handling requests for authorised users ahead of those for anonymous ones. One can easily image software which rejects incoming anonymous requests when it is saturated. But that’s hard to write, and our current networks, operating systems, languages and frameworks make that more difficult than it has to be.
const_cast
6 days ago
Kind of, but they lie in the HTTP request - their user agent isn't true, they don't disclose they're a bot, they try to replicate other traffic as a disguise, they use many different IPs so they can't easily be blocked, etc.
It's kind of like me asking to borrow your car to go to work and then I take your car and ship it overseas.
polotics
6 days ago
oh yeah I was horrified recently starting up a "smart TV" and going through the installable apps to find a lot of repackaged youtube contents, even from creators I like, eg. a chess channel. The app just provides the same content as the youtube channel does but at the bottom of the long free-to-use license agreement there is a weirdly worded clause that says you grant the app the right to act as a proxy for partner traffic... So many smart TV users are unwittingly providing residential IP's for the app developer to rent out.
eGQjxkKF6fif
5 days ago
Yeah, it's a disgrace. 'bUt YoU AgReeD tO iT So I HaVe The RIGht To Do ThIS' it's just cyber warfare.
Plain and simple.
polotics
3 days ago
Sorry I had forgotten who it was. Now time to name and shame: the culprit calls itself https://brightdata.com/ Also LG relying on the developer's own disclosure for what they call "Data Sefety" is really poor: "There is no relevant information provided by the developer" is all the app reports in the LG app store... Also no way to rate or report the app, I only found a mention that I shouldreport this to lgappsreport@lge.com
nickpsecurity
6 days ago
Some of us have little money or optimized for something else. I spent a good chunk of this and last year with hardly any groceries. So, even $30 a month in hosting and CDN costs was large.
Another situation is an expensive resource. This might be bandwidth hogs, CPU heavy, or higher licensing per CPU's in databases. Some people's sites or services dont scale well or hit their budget limits fast.
In a high-security setup, those boxes usually have limited performance. It comes from the runtime checks, context switches, or embedded/soft processors. If no timing channels, one might have to disable shared caches, too.
Those systems run slow enough that whatever is in front usually needs to throttle the traffic. We'd want no wasted traffic given their cost ranges from $2,000 / chip (FPGA) to six digits a system (eg XTS-500 w/ STOP OS). One could say the same if it was a custom or open-source chip, like Plasma MIPS.
Many people might be in the poor category. A significant number are in the low-scalability category. The others are rare but significant.
rozap
6 days ago
He said in the article there were requests that made a tarball of an entire repository, for each sha the git tree. No matter how you slice it that's pretty ugly.
Sure, you could require any number of (user hostile) interactions (logins, captchas, etc) to do expensive operations like that, but now the usability is compromised which sucks.
Dylan16807
6 days ago
> 20 requests per second is just... Nothing.
Unless you're running mediawiki.
Are there easy settings I should be messing with for that?
haiku2077
5 days ago
Do what the Arch Wiki did and install https://anubis.techaro.lol/
rnmg
6 days ago
Can you expand on the better ways of handling bots? Genuinely curious.
layer8
6 days ago
He’s saying that a modern web server setup should be able to handle the traffic without any bot-specific handling. Personally, I don’t follow.
haiku2077
6 days ago
My server can handle it, my ISP cannot!
horsawlarway
3 days ago
My biggest recommendation is to just get familiar with the caching constructs that are available. I understand folks think CDNs are complicated and expensive, but they're honestly incredibly cheap and relatively easy to use.
99.9% of the time, just showing static content with a good cache-control header will solve the issue. If you have a restrictive IP provider, use a CDN to do it for you for cheap.
The more involved recommendation is trimming out features of hosted apps that aren't all that useful and are causing problems. A simple example of what I mean...
---
The author here is noting that his Gitea instance is seeing huge load from 20r/s, which just isn't reasonable (I actually host a Gitea instance myself and I know it can handle 10 times this traffic, even when running on a raspberry pi). So why is his failing?
Well - it sounds like he's letting bots hit every url of a public instance. Not the choice I'd make, but also not unreasonable, hosting public things is fine.
Buuut - it also sounds like he's left the "Download archive" button enabled on the instance.
That's not a good call. It's a feature that's used very rarely by real humans, but is tripmine for any bot scanning the site to trigger high load and network traffic.
Want a 5 second solution to the problem? Set `DISABLE_DOWNLOAD_SOURCE_ARCHIVES` in the Gitea config (see https://docs.gitea.com/administration/config-cheat-sheet). Problem solved. Bots are no longer an issue. They are welcome to scan and not cause problems anymore.
What if your app doesn't have an easy config option? Nginx will happily help, with far less complexity and frustrations than trying to blindly play whack-a-mole with IP addresses (this is terrible and does not work... period).
Configure nginx with a specific path rule that either blocks requests to that path entirely, or places it behind basic auth (it doesn't need to be clever, and you don't even need to make it secret - hell, put the basic auth user/pass directly in the repo's readme, or show it on your site.) The bots won't hit it anymore.
---
So I guess what I'm saying is really - if you're finding that bots on your sites are causing a problem, consider just treating them like users and solving the problem, instead of going mad and trying to remove the bots.
Be constructive instead of destructive.
Ultimately, a lot of those bots are scanning that content to show to users, many are even doing it directly at the request of the user currently interacting with them.
Falling into the trap of the "fuck the bots" mindset is a sure way to lose (although it can feel good emotionally). It's not understanding the problem, it's not solving the problem, and it's limiting access to a thing you intentionally made public. Users are on the other end of those bots.
He's choosing to play the "everyone loses" square of the prisoner's dilemma.
lingo334
2 days ago
> My biggest recommendation is to just get familiar with the caching constructs that are available.
That doesn't help. They request seemingly random resources in seemingly random order. While they do often hit some links multiple times it's usually too few and far between for caching to be of any help.
As to the rest, "Just turn features off. No one uses them, trust me bro!"
user
6 days ago
ThePinion
6 days ago
Can you further elaborate on this robots.txt? I was under the impression most AI just completely ignores anything to do with robots.txt so you may just be hitting the ones that are maybe attempting to obey it?
I'm not against the idea like others here seem to be, I'm more curious about implementing it without harming good actors.
kevindamm
6 days ago
If your robots.txt has a line specifying, for example
Disallow: /private/beware.zip
and you have no links to that file from elsewhere on the site, then if you get a request for that URL it was because someone/something read the robots.txt and explicitly violated it, then you can send it a zipbomb or ban the source IP or whatever.But in my experience it isn't the robots.txt violations being so flagrant (half the requests are probably humans who were curious what you're hiding, and most bots written specifically for LLMs don't even check the robots.txt). The real abuse is the crawler that hits an expensive and frequently-changing URL more often than reasonable, and the card-testers hitting payment endpoints, sometimes with excessive chargebacks. And port-scanners, but those are a minor annoyance if your network setup is decent. And email spoofers who bring your server's reputation down if you don't set things up correctly early on and whenever changing hosts.
p3rls
6 days ago
I run one of the largest wikis in my niche and convincing the other people on my dev team to use gzip bombs as a defensive measure has been impossible-- they are convinced that it is a dangerous liability (EU-brained) and isn't worth pursuing.
Do you guys really use these things on real public-facing websites?
pdimitar
6 days ago
Very curious if a bad actor can sue you if you serve them a zip bomb from an EU server. Got any links?