hackernews client

Using lots of little tools to aggressively reject the bots

215 pointsposted 7 days ago

154 Comments

miladyincontrol

7 days ago

A lot of less scrupulous crawlers just seem to imitate the big ones. I feel a lot of people make assumptions because the user agent has to be true, right?

My fave method is still just to have bait info in robots.txt that gzip bombs and autoblocks all further requests from them. Was real easy to configure in Caddy and tends to catch the worst offenders.

Not excusing the bot behaviours but if a few bots blindly take down your site, then an intentionally malicious offender would have a field day.

horsawlarway

6 days ago

Your last comment feels pretty apt.

Maybe I'm just a different generation than the folks writing these blog posts, but I really don't understand the fixation on such low resource usage.

It's like watching a grandparent freak out over not turning off an LED light or seeing them drive 15 miles to save 5c/gallon on gas.

20 requests per second is just... Nothing.

Even if you're dynamically generating them all (and seriously... Why? Time would have been so much better spent fixing that with some caching than this effort) it's just not much demand.

I get the "fuck the bots" style posts are popular in the Zeitgeist at the moment, but this is hardly novel.

There are a lot more productive ways to handle this that waste a lot less of your time.

1. I fear you may be underestimating the volume of bot traffic websites are now receiving. I recommend reading this article to get an idea of the scale of the problem: https://thelibre.news/foss-infrastructure-is-under-attack-by...

2. Not all requests are created equal. 20 requests a second for the same static HTML file? No problem. But if you have, say, a release page for an open source project with binary download links for all past versions for multiple platforms, each one being a multi megabyte blob, and a scraper starts hitting these links, you will run into bandwidth problems very quickly, unless you live in a utopia where bandwidth is free.

3. You are underestimating the difficulty of caching dynamic pages. Cache invalidation is hard, they say. One notably problematic example is Git blames. So far I am not aware of any existing solution for caching blames, and jury rigging your own will likely not be any easier than the “solution” explored in the TFA.

hartator

6 days ago

> 2. Not all requests are created equal. 20 requests a second for the same static HTML file? No problem. But if you have, say, a release page for an open source project with binary download links for all past versions for multiple platforms, each one being a multi megabyte blob, and a scraper starts hitting these links, you will run into bandwidth problems very quickly, unless you live in a utopia where bandwidth is free.

All of this is (and should) cached on a cdn. You can go 1000 QPS on this in that config.

tdeck

6 days ago

I think a person should be able to set up a small website on a niche topic without ever having to set up a CDN. Until very recently this was the case, so it's sad to see that simplicity go away purely to satisfy the data needs of shady scrapers.

busymom0

6 days ago

Shouldn't such big blogs be put on something like CloudFlare R2 or BackBlaze or even S3 with their caching in front? Instead of having your server handle such file downloads.

phantomathkg

6 days ago

Cache also cost money as well. Nothing is free.

user

5 days ago

[deleted]

charcircuit

6 days ago

>you live in a utopia where bandwidth is free.

It's called peering agreements and they are very common. There's a reason social media and sites like YouTube, Twitch, TikTok don't immediately go out of business. The bandwidth is free for most users.

chii

6 days ago

there's only a handful of entities in the world that are capable of peering. Most people have to pay for their bandwidth.

charcircuit

6 days ago

It can be done by whoever is providing you internet connectivity. Not everywhere adds extra charges for bandwith.

spookie

6 days ago

A friend of mine had over 1000 requests/sec on his Gitea at peaks. Also, you aren't taking into account some of us don't have a "server", just some shitbox computer in the basement.

This isn't about mere dozen requests. It gets pretty bad. It also slows down his life.

jdboyd

5 days ago

The "shitbox computer in the basement" is something I would call a server. I mean, it is more capable than most VPSs (except in upload speed to the Internet).

vladvasiliu

6 days ago

I sympathize with the general gist of your post, but I've seen many a bot generate more traffic than legitimate users on our site.

Never had any actual performance issue, but I can see why a site that expects generally a very low traffic rate might freak out. Could they better optimize their sites? Probably, I know ours sucks big time. But in the era of autoscaling workloads on someone else's computer, a misconfigured site could rack up a big ass bill.

eGQjxkKF6fif

6 days ago

It's not fuck the bots, it's fuck the bot owners for using the websites as they want, and not at minimum, asking. Like 'hey cool if I use this tool to interact with your site for this and that reason?'

No, they just do it. So that can scrape data, which at this point in time for AI which has hit the cap on what it can consume knowledge wise, scrapes it because live updates and new information is most valuable to them.

So they will find tricky, evil ways to hammer resources that we as site operators own; even minimally to use site data to their profit, their success, their benefits while blatantly saying 'screw you' as they ignore robots.txt or pretend to be legitimate users.

There's a digital battle field going on. Clients are coming in as real users using IP lists like from https://infatica.io/

A writeup posted to HN about it

https://jan.wildeboer.net/2025/04/Web-is-Broken-Botnet-Part-...

A system and site operator has every right to build the tools they want to protect their systems, data, and have a user experience that benefits their audiences.

Your points are valid and make sense, but; it's not about that. It's about valuing authentic works, intellectual properties, and some dweeb that wants to steal it doesn't get to just run their bots against resources at others detriments, and their benefits.

eadmund

6 days ago

> Like 'hey cool if I use this tool to interact with your site for this and that reason?'

They do ask: they make an HTTP request. How the server responds to that request is up to the owner. As in the article, the owner can decide to respond to that request however he likes.

I think that a big part of the issue is that software is not well-written. If you think about it, even the bots constantly requesting tarballs for git commits doesn’t have to destroy the experience of using the system for logged-in users. One can easily imagine software which prioritises handling requests for authorised users ahead of those for anonymous ones. One can easily image software which rejects incoming anonymous requests when it is saturated. But that’s hard to write, and our current networks, operating systems, languages and frameworks make that more difficult than it has to be.

const_cast

6 days ago

Kind of, but they lie in the HTTP request - their user agent isn't true, they don't disclose they're a bot, they try to replicate other traffic as a disguise, they use many different IPs so they can't easily be blocked, etc.

It's kind of like me asking to borrow your car to go to work and then I take your car and ship it overseas.

polotics

6 days ago

oh yeah I was horrified recently starting up a "smart TV" and going through the installable apps to find a lot of repackaged youtube contents, even from creators I like, eg. a chess channel. The app just provides the same content as the youtube channel does but at the bottom of the long free-to-use license agreement there is a weirdly worded clause that says you grant the app the right to act as a proxy for partner traffic... So many smart TV users are unwittingly providing residential IP's for the app developer to rent out.

eGQjxkKF6fif

5 days ago

Yeah, it's a disgrace. 'bUt YoU AgReeD tO iT So I HaVe The RIGht To Do ThIS' it's just cyber warfare.

Plain and simple.

polotics

3 days ago

Sorry I had forgotten who it was. Now time to name and shame: the culprit calls itself https://brightdata.com/ Also LG relying on the developer's own disclosure for what they call "Data Sefety" is really poor: "There is no relevant information provided by the developer" is all the app reports in the LG app store... Also no way to rate or report the app, I only found a mention that I shouldreport this to lgappsreport@lge.com

nickpsecurity

6 days ago

Some of us have little money or optimized for something else. I spent a good chunk of this and last year with hardly any groceries. So, even $30 a month in hosting and CDN costs was large.

Another situation is an expensive resource. This might be bandwidth hogs, CPU heavy, or higher licensing per CPU's in databases. Some people's sites or services dont scale well or hit their budget limits fast.

In a high-security setup, those boxes usually have limited performance. It comes from the runtime checks, context switches, or embedded/soft processors. If no timing channels, one might have to disable shared caches, too.

Those systems run slow enough that whatever is in front usually needs to throttle the traffic. We'd want no wasted traffic given their cost ranges from $2,000 / chip (FPGA) to six digits a system (eg XTS-500 w/ STOP OS). One could say the same if it was a custom or open-source chip, like Plasma MIPS.

Many people might be in the poor category. A significant number are in the low-scalability category. The others are rare but significant.

rozap

6 days ago

He said in the article there were requests that made a tarball of an entire repository, for each sha the git tree. No matter how you slice it that's pretty ugly.

Sure, you could require any number of (user hostile) interactions (logins, captchas, etc) to do expensive operations like that, but now the usability is compromised which sucks.

Dylan16807

6 days ago

> 20 requests per second is just... Nothing.

Unless you're running mediawiki.

Are there easy settings I should be messing with for that?

haiku2077

5 days ago

Do what the Arch Wiki did and install https://anubis.techaro.lol/

rnmg

6 days ago

Can you expand on the better ways of handling bots? Genuinely curious.

layer8

6 days ago

He’s saying that a modern web server setup should be able to handle the traffic without any bot-specific handling. Personally, I don’t follow.

haiku2077

6 days ago

My server can handle it, my ISP cannot!

horsawlarway

3 days ago

My biggest recommendation is to just get familiar with the caching constructs that are available. I understand folks think CDNs are complicated and expensive, but they're honestly incredibly cheap and relatively easy to use.

99.9% of the time, just showing static content with a good cache-control header will solve the issue. If you have a restrictive IP provider, use a CDN to do it for you for cheap.

The more involved recommendation is trimming out features of hosted apps that aren't all that useful and are causing problems. A simple example of what I mean...

---

The author here is noting that his Gitea instance is seeing huge load from 20r/s, which just isn't reasonable (I actually host a Gitea instance myself and I know it can handle 10 times this traffic, even when running on a raspberry pi). So why is his failing?

Well - it sounds like he's letting bots hit every url of a public instance. Not the choice I'd make, but also not unreasonable, hosting public things is fine.

Buuut - it also sounds like he's left the "Download archive" button enabled on the instance.

That's not a good call. It's a feature that's used very rarely by real humans, but is tripmine for any bot scanning the site to trigger high load and network traffic.

Want a 5 second solution to the problem? Set `DISABLE_DOWNLOAD_SOURCE_ARCHIVES` in the Gitea config (see https://docs.gitea.com/administration/config-cheat-sheet). Problem solved. Bots are no longer an issue. They are welcome to scan and not cause problems anymore.

What if your app doesn't have an easy config option? Nginx will happily help, with far less complexity and frustrations than trying to blindly play whack-a-mole with IP addresses (this is terrible and does not work... period).

Configure nginx with a specific path rule that either blocks requests to that path entirely, or places it behind basic auth (it doesn't need to be clever, and you don't even need to make it secret - hell, put the basic auth user/pass directly in the repo's readme, or show it on your site.) The bots won't hit it anymore.

---

So I guess what I'm saying is really - if you're finding that bots on your sites are causing a problem, consider just treating them like users and solving the problem, instead of going mad and trying to remove the bots.

Be constructive instead of destructive.

Ultimately, a lot of those bots are scanning that content to show to users, many are even doing it directly at the request of the user currently interacting with them.

Falling into the trap of the "fuck the bots" mindset is a sure way to lose (although it can feel good emotionally). It's not understanding the problem, it's not solving the problem, and it's limiting access to a thing you intentionally made public. Users are on the other end of those bots.

He's choosing to play the "everyone loses" square of the prisoner's dilemma.

lingo334

2 days ago

> My biggest recommendation is to just get familiar with the caching constructs that are available.

That doesn't help. They request seemingly random resources in seemingly random order. While they do often hit some links multiple times it's usually too few and far between for caching to be of any help.

As to the rest, "Just turn features off. No one uses them, trust me bro!"

user

6 days ago

[deleted]

ThePinion

6 days ago

Can you further elaborate on this robots.txt? I was under the impression most AI just completely ignores anything to do with robots.txt so you may just be hitting the ones that are maybe attempting to obey it?

I'm not against the idea like others here seem to be, I'm more curious about implementing it without harming good actors.

kevindamm

6 days ago

If your robots.txt has a line specifying, for example

   Disallow: /private/beware.zip

and you have no links to that file from elsewhere on the site, then if you get a request for that URL it was because someone/something read the robots.txt and explicitly violated it, then you can send it a zipbomb or ban the source IP or whatever.

But in my experience it isn't the robots.txt violations being so flagrant (half the requests are probably humans who were curious what you're hiding, and most bots written specifically for LLMs don't even check the robots.txt). The real abuse is the crawler that hits an expensive and frequently-changing URL more often than reasonable, and the card-testers hitting payment endpoints, sometimes with excessive chargebacks. And port-scanners, but those are a minor annoyance if your network setup is decent. And email spoofers who bring your server's reputation down if you don't set things up correctly early on and whenever changing hosts.

p3rls

6 days ago

I run one of the largest wikis in my niche and convincing the other people on my dev team to use gzip bombs as a defensive measure has been impossible-- they are convinced that it is a dangerous liability (EU-brained) and isn't worth pursuing.

Do you guys really use these things on real public-facing websites?

pdimitar

6 days ago

Very curious if a bad actor can sue you if you serve them a zip bomb from an EU server. Got any links?

ThomW

6 days ago

It ticks me off that bots no longer respect robots.txt files at all. The authors of these things are complete assholes. If you’re one of them, gfy.

thowaway7564902

6 days ago

You might as well say gfy to anyone using chat bots, search engines and price comparison tools. They're the one's who financially incentivize the scrapers.

mandmandam

6 days ago

That doesn't logic.

Giving someone a "financial incentive" to do something (by gasp using a search engine, or comparing prices) does not make that thing ethical or cool to do in and of itself.

I wonder where you ever got the idea that it does.

braiamp

4 days ago

Because there's a market to serve, that you refuse to serve, so the stop gap solution is for third parties to acquire the liabilities and risks for compensation.

mandmandam

4 days ago

Financial incentive ≠ moral justification.

You're shifting the frame, to one where morality doesn't come into it. You're asserting some kind of market inevitability, which is probably the same sort of rationalization arms and people traffickers use to sleep at night.

hinkley

5 days ago

Honey potting the robots file is handy for the ones who don’t just ignore it but go looking for trouble.

cyanydeez

6 days ago

[flagged]

whatevermom

6 days ago

[flagged]

oblio

6 days ago

Why do you do it?

jeffhuys

6 days ago

Profit in terms of knowledge or just actual money (guessing)

immibis

6 days ago

I consider the disk space issue a bug in Gitea. When someone downloads a zip, it should be able to stream the zip to the client, but instead it builds the zip in temporary space, serves it to the client, and doesn't delete it.

I solved it by marking that directory read-only. Zip downloads, obviously, won't work. If someone really wants one, they can check out the repository and make it theirself.

If I really cared, of course I'd fix the bug or make sure there's a way to disable the feature properly or only enable it for logged-in users.

Also I server-side redirect certain user-agents to https://git.immibis.com/gptblock.html . This isn't because they waste resources any more but just because I don't like them, what they're doing is worthless anyway, and because I can. If they really want the data of the Git repository they can clone the Git repository instead of scraping it in a stupid way. That was always allowed.

8 requests per second isn't that much unless each request triggers intensive processing and indeed it wasn't a load on my little VPS other than the disk space bug. I blocked them because they're stupid, not because they're a DoS.

atomman747

7 days ago

I get that the author “doesn’t care anymore”, but I saw Google, ripe.net, and semrush in the banned IPs.

Of these, I certainly wouldn’t ban Google, and probably not the others, if I wanted others to see it and talk about it.

Even if your content were being scraped for some rando’s AI bot, why have a public site if you don’t expect your site to be used?

Turning the lights off on the motel sign when you want people to find it is not a good way to invite people in.

DamonHD

7 days ago

Semrush misbehaved so badly for such a long time with various levels of incompetence that I have special notes in my robots.txt files going back at least 8 years and eventually got someone in legal to pay attention. I see zero value to me or any potential users of mine in letting an 'SEO' firm lumpishly trample all over my site barging out real visitors. And some of Semrush's competitors were just as bad.

The current round of AI nonsense also very poor. Again had to send legal notes to investor relations and PR depts in at least one well-known case, as well as all the technical measures, to restore some sort of decorum.

akudha

7 days ago

My experience with them has been the same. On one of my employer's websites, the top three bots were Google, Bytedance and Semrush. It is a small website for a niche audience, not even in English, and changes very infrequently (like once or twice a quarter). That did not stop these three bots from hammering the site, every second

hinkley

5 days ago

We were getting hammered on nofollow links. Someone made a change to stop putting the nofollow in and using JavaScript to traverse on interaction, but that person left sometime after the toggle went in and we discovered it had never been turned on much later, when we were rearranging our ingress load balancer and needed to tweak how we handled bot throttling. That was at least 10% of the requests during peak traffic and extended the duration of the crawls by a lot. So dumb.

phyzome

7 days ago

Because the bot requests are consuming significant amounts of bandwidth, memory, CPU, and disk space. Like the intro says, it's just rude, and there's no reason to serve traffic to harvesters like that.

Google also runs an AI scraper, which might be what you saw represented there?

globie

6 days ago

From the article it's sure starting to seem like people across the internet are just starting to realize what happens when you don't have just 3-4 search engines responsible for crawling for data anymore. When data becomes truly democratized, its access increases dramatically, and we can either adjust or shelter ourselves while the world moves on without us.

Did Google never ever scrape individual commits from Gitea?

HumanOstrich

6 days ago

> When data becomes truly democratized...

That is not at all what is happening.

globie

6 days ago

I know, we're locking everything down behind WAFs and repeating captchas so only attested identities can get access in the end.

hooverd

6 days ago

Ok, and, what your solution?

HumanOstrich

6 days ago

Yep, everyone is building their own little walled gardens instead of adapting.

itsafarqueue

6 days ago

Friend, this IS them adapting.

HumanOstrich

6 days ago

More like maladapting.

BLKNSLVR

6 days ago

Can you please provide an explanation of what you consider 'adapting' to be?

What's described in the article did my personal definition pretty well.

BlueTemplar

6 days ago

You haven't started educating people in the last ten years that it's just not acceptable to use Google any more ?

Especially when this happens ?

Google is using AI to censor independent websites like mine

https://news.ycombinator.com/item?id=44124820

Sure sounds like we've reached the point where it's more of a liability !

taormina

6 days ago

There are also bad actors who pretend to be the Google scraper. Google once upon a time had a reputation for respectfully scrapping, but if he's getting the traffic he needs with or without the Googlebot, why should he care?

hinkley

5 days ago

They had those stupid tags telling them far more anyway. Nevermind what it was doing to time to first paint/interactive.

vachina

7 days ago

I’ve turned off logging on my servers precisely because it’s growing too quickly due to these bots. They’re that relentless, and would fill every form, even access APIs otherwise accessible only by clicking around the site. Anthropic, openAI and Facebook are still scraping to this day.

davidmurdoch

6 days ago

> even access APIs otherwise accessible only by clicking around the site

How else?

eGQjxkKF6fif

6 days ago

Would you mind sharing information on these crawlers accessing APIs only usable for clicking around on websites?

And to clarify,

It's a part of the UI or something and only a human should be pressing it, and there's no other way to access that API or something?

AI agents exist now, there is virtually no way to distinguish between real user and bot if they mimic human patterns.

vachina

5 days ago

They’re using the sign up and sign in forms, and also the search, and then clicking on those search results. I thought some bad actor is masquerading as AI scrapers to enumerate accounts, but their behavior is consistent with a scraper.

eGQjxkKF6fif

5 days ago

Read this: https://jan.wildeboer.net/2025/04/Web-is-Broken-Botnet-Part-...

Basically everybody's a bot in a hidden botnet now. And we agreed to it. Phones, tablets, Windows appstores add the SDK in, and then drone. One of the big ones is "Infatica" - devs get paid to put this in legitimate 'apps'

AI companies, and whoever and whatever else use the reputation-good IPs to hammer sites and well, it's fair game for all malicious people.

Proofread0592

7 days ago

It is nice that the AI crawler bots honestly fill out the `User-Agent` header, I'm shocked that they were the source of that much traffic though. 99% of all websites do not change often enough to warrant this much traffic, let alone a dev blog.

grishka

7 days ago

They also respect robots.txt.

However, I've also seen reports that after getting blocked one way or another, they start crawling with browser user-agents from residential IPs. But it might also be someone else misrepresenting their crawlers as OpenAI/Amazon/Facebook/whatever to begin with.

cratermoon

7 days ago

> They also respect robots.txt

All the reports I've heard from organizations dealing with AI crawler bots say they are not honest about their user agent and do not respect robots.txt

"It's futile to block AI crawler bots because they lie, change their user agent, use residential IP addresses as proxies, and more." https://xeiaso.net/notes/2025/amazon-crawler/

eesmith

6 days ago

Further info along the same lines at https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...

> If you think these [AI] crawlers respect robots.txt then you are several assumptions of good faith removed from reality. These bots crawl everything they can find, robots.txt be damned, including expensive endpoints like git blame, every page of every git log, and every commit in every repo, and they do so using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.

Sourcehut (the site described) used Anubis before swithing "to go-away, which is more configurable and allows us to reduce the user impact of Anubis (e.g. by offering challenges that don’t require JavaScript, or support text-mode browsers better)." https://sourcehut.org/blog/2025-05-29-whats-cooking-q2/

mkfs

6 days ago

> "It's futile to block AI crawler bots because they lie, change their user agent, use residential IP addresses as proxies, and more." https://xeiaso.net/notes/2025/amazon-crawler/

If this isn't the case, then the bot detection systems big sites are using must be pretty bad, because I do almost all of my browsing on a desktop originating from residential ASN IP address, and I routinely run up against CAPTCHAs. E.g., any Stack Exchange site on first visit, and even Amazon. What reason would there be for this, unless these crawlers are laundering their traffic through residential IPs?

grishka

6 days ago

Speaking from my own experience, which is admittedly limited, but still — I had AI bots crawling my fediverse server, I added them to my robots.txt as "Disallow: *", they stopped.

As I said, it might be someone else entirely using OpenAI/Amazon/Meta/etc user agents to hide their real identity while ignoring robots.txt. What's to stop them? People blame those companies anyway.

immibis

6 days ago

However, there's no evidence those bots are really OpenAI et al.

rovr138

7 days ago

We ended up writing similar rules to the article. It was just based on frequency.

While we were rate limiting bots based on UA, we ended up also having to apply wider rules because traffic started spiking from other places.

I can't say if it's the traffic shifting, but there's definitely a big amount of automated traffic not identifying itself properly.

If you look at all your web properties, look at historic traffic to calculate <hits per IP> in <time period>. Then look at the new data and see how it's shifting. You should be able to identify the real traffic and the automated very quickly.

user

7 days ago

[deleted]

reconnecting

7 days ago

Creator of tirreno [1] here.

While our platform is primarily designed for live, logged-in users, it also works well for bot detection and blocking.

We anonymize IP addresses by replacing the last octet with an asterisk, effectively grouping the same subnet under a single account. You can then use the built-in rule engine to automatically generate blacklists based on specific conditions, such as excessive 500 or 404 errors, brute-force login attempts, or traffic from data center IPs.

Finally, you can integrate tirreno blacklist API into your application logic to redirect unwanted traffic to an error page.

Bonus: a dashboard [2] is available to help you monitor activity and fine-tune the blacklist to avoid blocking legitimate users.

[1] https://github.com/tirrenotechnologies/tirreno

[2] https://play.tirreno.com/login (admin/tirreno)

mindslight

7 days ago

> We anonymize IP addresses by replacing the last octet with an asterisk, effectively grouping the same subnet under a single account

So as a user, not only do I have to suffer your blockwall's false positives based on "data center IPs" (ie most things that aren't naively browsing from the information-leaking address of a last-mile connection like some cyber-bumpkin). But if I do manage to find something that isn't a priori blocked (or manage to click through 87 squares of traffic lights), I still then get lumped in with completely unrelated address-neighbors to assuage your conscience that you're not building a user surveillance system based on nonconsentually processing personal information.

Just please make sure you have enough of a feedback process that your customers can see that they are losing real customers with real dollars.

reconnecting

7 days ago

You're right, blanket blocking based on IP ranges (like TOR or DC) often creates false positives and punishes privacy-conscious users. Therefore, unlike the traditional way of blocking an IP just because it is from a data center, tirreno suggests using a risk-based system that takes into account dozens or hundreds of rules.

As in my example, if the IP is from a data center and creates a lot of 404 errors, send it to a manual review queue or to automatic blocking (not recommended).

Personally, I prefer to manually review even bot activity, and tirreno, even if it's not directly designed for bot management, works great for this, especially in cases when bad bots are hidden behind legitimate bot UA's.

yjftsjthsd-h

6 days ago

> Therefore, unlike the traditional way of blocking an IP just because it is from a data center, tirreno suggests using a risk-based system that takes into account dozens or hundreds of rules.

If you ban by the /24, that really feels like you're coming back to the previous approach, just with extra steps.

reconnecting

6 days ago

I made it clear from the very beginning that tirreno works with logged-in users.

However, if someone prefers to take action against non-logged in users based solely on IP, that's their own choice.

wincy

6 days ago

I keep forgetting that metacritic just doesn’t work with any VPN I’ve tried, but in a way that’s really annoying (it loads the site then never loads any scores or anything you actually want to see).

It’s so annoying.

reconnecting

6 days ago

I was literally shocked that one European PaaS blocked all TOR exit nodes for all their clients web resources because of security concerns instead of setting up proper online fraud prevention.

user

6 days ago

[deleted]

nixgeek

6 days ago

Increasing amounts of traffic are flowing through Google VPN if you’re on a Pixel, and on Apple, there is iCloud Private Relay. I’d have thought the address-neighbors issue would be especially likely to catch out these situations?

Overly simplistic solutions like this absolutely will actively cost you real customers and real revenue.

reconnecting

6 days ago

tirreno has a rule for Private Relay/Starlink IP addresses, which can be configured as either a positive or a negative signal, depending on specific needs.

There is a extra cost only if you choose to block users automatically, regardless of the tool used.

dbetteridge

6 days ago

How are you dealing with isps moving to cgnat, where an ip could realistically represent hundreds of users?

reconnecting

6 days ago

> While our platform is primarily designed for live, logged-in users, it also works well for bot detection and blocking.

tirreno is designed to work with logged-in users, so all actions are tied to usernames rather than IP addresses. From this perspective, we strongly avoid [1] making any decisions or taking actions based solely on IP address information.

[1] https://www.tirreno.com/bat/?post=2025-05-25

reconnecting

7 days ago

We also have work in progress to block bots based on publicly available IP ranges through the same dashboard. Any suggestions are welcome.

nickpsecurity

6 days ago

My sites get 10-20k requests a day. Mostly AI scrapers. One thing I noticed is many look for specific, PHP pages. If you dont use PHP, you might be able to autoblock any IP requesting PHP pages. If you have PHP, block those requesting pages you dont have.

Some of us are happy to train AI's but want to block overload. For instance, I'm glad they're scraping pages about the Gospel and Biblical theology. It might help to put anything large that you dont want scraped into specifi directories. Then, upon detecting a bot, block the IP from accessing those.

In my case, I also have a baseline strategy to deal with a large number of requests. That's text only, HTML/CSS presentation, other stuff externally hosted, and BunnyCDN with Perma-Cache ($10/mo + 1 penny / GB). The BunnyCDN requests go to $5/mo. VM's on Digital Ocean. I didnt even notice AI scrapers at first since (a) they didn't affect performance and (b) a month of them changed my balance from $30 to $29.99.

(Note to DO and Bunny team members that may be here: Thanks for your excellent services.)

fluidcruft

6 days ago

I've been wondering if there's anything like "page knocking" like you open a series of pages in a certain order and that grants access?

For example maybe everything 404s unless you start with a specific unlisted url

gcr

4 days ago

I've seen the opposite of this -- have a link on the site footer that's unlikely to be clicked by a user or prefetched, but as soon as a robot GETs it, the rest of the site is replaced by syntactically valid gibberish.

chongli

6 days ago

How does that allow authorized users to get access? Note that they want regular people to be able to find the project from a SERP and start browsing without creating an account or authenticating in any way.

fluidcruft

6 days ago

I think it depends on what you mean by authorized users. If you're saying everyone is authorized then relying on public scrapers isn't going to work with unlisted pages. Search engines are bots, too.

I'm thinking more if it's people coming from email or QR code or link from other sites that keep bots out. It can also be more about sequence of accessing pages. The idea is having behavior unlock the site. Because it should be pretty easy to cause people and bots to behave differently.

For example, you could set it up such that access to gitea requires simply accessing the documentation site first.

chongli

6 days ago

The issue here is with publicly hosted git repositories for open source projects. Everyone is authorized, including anonymous users arriving from Google.

fluidcruft

5 days ago

Nobody needs access to a git repo directly via Google.

Things that are needed in Google are documentation or landing sites. Landing on one of those could perform the "knock" to open the domain holding the git. Nobody using Google would notice.

chongli

5 days ago

Nobody needs access to a git repo directly via Google.

I agree with this. The repo page should just have static HTML pages with information for people new to the project. Dynamic information and especially expensive computations such as git blame should be restricted to logged in users with an account. This would make it easy to ban anyone who was abusing the system or even implement CPU quotas without affecting ordinary users arriving from Google.

blharr

6 days ago

This would be user unfriendly regardless of how you design around it. If I save a bookmark or send a link to a friend, they're going to get a 404.

fluidcruft

6 days ago

It can be less unfriendly than requiring them to create an account and login. You could send 403 instead.

AStonesThrow

6 days ago

Chris Siebenmann has shared many thoughts about crawlers, aka spiders.

https://utcc.utoronto.ca/~cks/space/?search=spider

https://utcc.utoronto.ca/~cks/space/?search=crawlers

It's interesting how in 2005 he was advocating for playing nicely, because yes, search engines were hypothetically driving traffic to sites they crawled. And there don't seem to be a lot of complaints about performance hits. Though typically, his comments and observations are restricted to his blog's corner of his department.

xelxebar

6 days ago

My little server just hums along, and I haven't checked fail2ban status recently.

    sudo fail2ban-client status | sed -n '/Jail list:/{s/^.*://; s/,//g; p}' | xargs -n1 sudo fail2ban-client status | awk '/jail:/{i=$5}; /Total failed:/{jail[i]=$5}; END{for(i in jail) printf("%s: %s\n", i, jail[i])}' | column -t
    sshd-ddos:        0
    postfix:          583
    dovecot:          9690
    postfix-sasl:     4227
    nginx-botsearch:  1421
    nginx-http-auth:  0
    postfix-botnet:   5425
    sshd:             202157

Yikes! 220,000 banned IPs...

reconnecting

6 days ago

We have one bot identified as 'DotBot/1.2' that, as of today, has made over 220,000 requests in the past two weeks.It randomly requests filenames and folders from the web server.

I'm not banning it out of curiosity, to see if there's a limit to how deep it digs.

hananova

7 days ago

Back in the 90's, me as a dumb kid got a phone call from my ISP telling me that my system had been enlisted in a botnet and that they were turning off my connection until I fixed it.

Maybe it's time to once again block entire ASN's of ISP's that tolerate being used as residential proxies, and this shit might end.

NitpickLawyer

6 days ago

The residential proxies are not offered by ISPs but by world-wide users (knowingly or not) installing dubious software in exchange for access / pennies / whatever. There was a pretty good article about it a while ago on hn.

aorth

6 days ago

Yep! Eye opening report on residential proxies from Trend Micro.

https://www.trendmicro.com/vinfo/us/security/news/vulnerabil...

vhcr

6 days ago

And? If your network allows malware you should be blocked.

StressedDev

6 days ago

You would end up blocking every network then. Almost no one wants malware on their machine or network. Unfortunately, people get hacked and network operators cannot always determine which machines are hacked (hackers are not known for letting people know that they have taken over a machine).

Security is hard and there are no easy solutions. People often do not know they have been hacked or even know if a computer on their network has been hacked. Also, it is often not easy to determine if traffic is legitimate, illegal, malicious, or abusive.

out-of-ideas

5 days ago

hmm, might as well extend this hypothetical a bit more... Epic would argue Apple as a whole is malware, and therefore all networks with apple devices on them are malware, block them all?

it does not work that way. the open internet goal is to be open. "AI" is generally meant to be intelligent, indistinguishable to humans; we created the problem for ourselves, dismissing so many (back then, and now) theoretical problems.

what i fear are increased operating costs for the end users to "prove they are human". what i hope happens is cheaper infra (bandwidth ect) to be able to support the increased demand. and, a big hope that "bots" which are respectful [but this too wont hold up, people are not 100% respectful, why will their bots be?]

so we will be stuck with more captcha related crap, more websites that must use javahelll, even a "trusted browser" (walling out any new browsers, and things like curl), maybe websites that want your ID too (the "think of the children!" folks would be pleased), and ways to "trick" the bots (and not the real users) to ban the bots.... yay

BLKNSLVR

6 days ago

I've setup a handful of firewall rules that fire off an alert if a device on my network attempts an outbound connection to a list of ports known to indicate malware infection.

The port list needs to be updated as malware changes it's targets fairly regularly.

It's a small thing, but it's another layer.

BLKNSLVR

6 days ago

Should we be moving to a push / submit methodology for centralised indexing such as AI and search?

No scraping, if I want you to read it I'll send it to you.

loloquwowndueo

7 days ago

Nice - I like that most of the ai scraper bot blocking was done using Nginx configuration. Still, once fail2ban was added to the mix (meaning: additional service and configuration), I wonder if considering something like Anubis (https://anubis.techaro.lol/) would have been more automatic. I’ve seen Anubis verification pages pop up more frequently around the web!

rovr138

7 days ago

FWIW, the reason I like their approach is that fail2ban is still lean, works off of the same logs, and doesn't start with the requirement to affect everyone's experience due to bad actors.

nulbyte

6 days ago

Making the visitor put up with your bot because you don't like someone else's is just bizarre to me. I disable js by default, so Ive been seeing this more and more. It's all rubbish these days.

xena

6 days ago

Anubis author here. I have been working on a no-js approach. I just have to be careful so that the scrapers don't just implement it and bypass it forever.

pabs3

2 days ago

go-away has one, it only works in browsers that follow meta refresh though.

RamRodification

6 days ago

I could be wrong, but the comment for the rate limiting (1r/s) doesn't seem to line up with the actual rate limit being set (5r/s).

  # Define a rate limit of 1 request per second every 1m
  limit_req_zone $binary_remote_addr zone=krei:10m rate=5r/s;

rovr138

7 days ago

Great article and sleuthing to find the information.

I know you're processing them dynamically as they come in and break the rules. But if you wanted to supplement the list, might be worth sourcing the ones from https://github.com/ai-robots-txt/ai.robots.txt at some frequency.

tasuki

6 days ago

I wonder, why do people maintain personal git forges?

Is it so others can submit issues and pull requests? Do others actually bother to create accounts at your personal git forge so they can contribute?

kassner

6 days ago

Personal choices. I want to selfhost all my stuff, and avoid relying on GitHub, and also having some codebases not being used in AI training. I achieve the latter by not having my forge public, so I don’t have much of a dog in this fight.

I could just have a directory in another server, but I like being able to link to a specific file/line/commit, and I prefer viewing the code in the browser in some situations. Forgejo takes the same effort to selfhost as gitweb on my setup, so why not?

tasuki

6 days ago

Ok, fair.

I have several remotes for my projects: one on a machine I control, another on GitHub, so I can send links to people etc.

degamad

6 days ago

Yep, mine is for my wife and a couple of friends to contribute only, and they're able to do so just fine. There are no public repositories on it, and the url is not published anywhere.

BlueTemplar

6 days ago

[flagged]

user

6 days ago

[deleted]

aswegs8

6 days ago

Ah, good old botkilling. Gotta love those.

nullc

6 days ago

I can't use mouser or digikey anymore thanks to anti-crawler crusading.

opan

6 days ago

I have always gotten the short end of the stick with this stuff as well. Captchas are annoying, Cloudflare challenges are usually not passable, etc. For some sites I use regularly, they'll log me out and stop me logging in a few times per year and then I have to tweet at them about it, which is usually followed by something about them having just turned up their bot protection settings, but then often they'll fix whatever they did and I can get back in.

Would it be so crazy to just let it happen? When ordinary users are suffering as much or more than the "criminal" targets, maybe it's not worth it. I understand there are cases where bandwidth usage would become too much, but there's gotta be a better way. Maybe let some bots through so that fewer humans are blocked.

sethops1

6 days ago

Blame the people shitting in the pool. Not the people closing down the pool because people keep shitting in it.

mleonhard

6 days ago

The website blocks the privacy proxy (VPN) service I use.

dakiol

6 days ago

OpenAI is the biggest offender according to my server’s logs. Nevertheless we keep praising them over here. We are so stupid.

DamonHD

6 days ago

Who's "we"?

kazinator

6 days ago

Over the past while, I have settled on a four component approach. I place less emphasis on banning; it has turned out to be a nuisance for legit users.

1. User-Agent pattern for various known bots. These gets sent to a honeypot (currently implementing zip bombs).

2. Arithmetic captcha page (called "captxa") protecting most of my /cgit tree, which is the primary target for scrapers. Solve a formula, get a persistent cookie that grants you access.

3. Static IP-level bans. The bulk of the unwanted traffic has been from Alibaba Cloud LLC addresses, so I got a list of their blocks and made rules against them all.

4. Log-scan-and-ban system: now mainly just targeting excessive activity over a short period.

About 2: I define the same cookie value given to everyone right inside the Apache httpd.conf. This is passed via en environment variable to the captcha page, a CGI script. When the captcha is solved, and the client has that cookie, it will be the Apache rules themselves which check it. The RewriteRule which redirects to the captcha page is guarded by a RewriteCond which checks the value of %{HTTP_COOKIE} for the presence of the required cookie.

I use the Apache <Macro ...> feature to define the anti-bot rules. Then the rules can be included in any VirtualHost block easily by invoking the macro.

The skeleton of the system looks like this:

  Define captxa_secret SecretGoesHere

  <Macro nobots>
    # Push captxa secret into environment, for the CGI script.
    SetEnv CAPTXA_SECRET ${captxa_secret}

    # These agents get sent to /honeypot
    RewriteCond %{HTTP_USER_AGENT} .*(ezoom|spider|crawler|scan|yandex|coccoc|github|python|amazonbot[Pp]etalbot|ahrefsbot|semrush|anthropic|Facebot|meta|openai|GPT|OPR/|Edg/).* [NC]
    RewriteCond %{REQUEST_URI} !honeypot
    RewriteRule ^.*$ /honeypot [R,L]

    # Clients which have the cookie with the captxa_secret get a pass
    RewriteCond %{HTTP_COOKIE} "^(.*; ?)?captxa=${captxa_secret}(;.*)?$"
    RewriteRule ^.*$ - [L]

    # Rules for no-cookie clients:

    # Root level /cgit site: ok for all
    RewriteRule ^/cgit/$ - [L]

    # First level of /cgit: the project pages: OK
    RewriteRule ^/cgit/([^/]+)/?$ - [L]

    # Certain project sub-pages like about, spanshot, README: Pass for hassle-free landing.
    RewriteRule ^/cgit/([^/]+)/(about|snapshot|refs|tree/RELNOTES|tree/README|tree/README\.md)/?$ - [L]

    # Some distros fetch TXR tarball snapshots from CGIT; make it easy for them.
    RewriteRule ^/cgit/txr/snapshot/txr-([0-9]+)\.tar\. - [L]
 
    # Everyone else off to captxa script: which will install the cookie
    # for those clients who solve it. $0 is the entire URL;
    # it gets passed to the script as a parameter, so the
    # script can redirect to that URL upon a successful solve.
    RewriteRule ^/cgit/?(.*)$ /captxa/?quiz=$0? [R,L,NE,QSA]
  </Macro>

Inside a VirtualHost I just do

  RewriteEngine on
  Use nobots   # include above macro

curtisszmania

6 days ago

[dead]

datavirtue

6 days ago

[flagged]

sneak

7 days ago

You don’t have to fend off anything, you just have to fix your server to support this modest amount of traffic.

Everyone else is visiting your site for entirely self-serving purposes, too.

I don’t understand why people are ok with Google scraping their site (when it is called indexing), fine with users scraping their site (when it is called RSS reading), but suddenly not ok with AI startups scraping their site.

If you publish data to the public, expect the public to access it. If you don’t want the public (this includes AI startups) to access it, don’t publish it.

Your website is not being misused when the data is being downloaded to train AI. That’s literally what public data is for.

bentley

7 days ago

> I don’t understand why people are ok with Google scraping their site (when it is called indexing), fine with users scraping their site (when it is called RSS reading), but suddenly not ok with AI startups scraping their site.

I’m pretty open‐minded about AI, and have no visceral objection to AI scraping in theory. However… the rise of AI scrapers is the first time in twenty years of running tiny websites that my servers—all my servers, consistently—hit CPU and bandwidth usage levels far above baseline, and even brush against my hosts’ monthly limits. When I check server logs, the traffic is largely bots repeatedly trawling infinite variations of version control history pages. Googlebot never crawled these enough to become a problem; Internet Archive never crawled these enough to become a problem. But the current scrapers are, and literally everyone I talk to who hosts their own VCS webservers is having the same problem right now. Many are choosing to reduce the problem by injecting Cloudflare or Anubis, which I as a user hate, and refuse to do myself—but what alternative can I propose to them? People running these servers are just users of Gitea and CGit and whatnot, they’re not capable of rewriting the underlying tools to reduce the CPU usage of diff generation.

sneak

7 days ago

Seems to me like standard dumb rate limiting (max 100r/min or like 250r/5min) per IP would solve the problem easily without affecting human users at all.

Surprised this isn’t a feature in Gitea already (even though it is better done in the reverse proxy web server).

DamonHD

7 days ago

Some scummy scrapers are snowshoeing their way through scads of (residential) IPs, just for one example.

Assuming that there are simple universal solutions, eg based on IP or UA, suggests that you have been lucky enough not to deal with many varieties of this misbehaviour.

user

6 days ago

[deleted]

red369

7 days ago

Is it because people viewed it as Google scraping the site to make an index so that people could find the site, while the AI scraping is intended so people won’t need to visit the site at all?

ehutch79

7 days ago

Also, google is relatively considerate when crawling.

DamonHD

7 days ago

Usually though from time to time over many many years from when G was still wet behind the ears I have had to write specific defences effectively just to deal with it.

ehutch79

6 days ago

Yeah, “Relatively” is doing a lot of heavy lifting there.

I remember people with lax security and a tags that went to deletion endpoints… we all learned a lot back in the day.

owebmaster

7 days ago

The AI apps (namely chatgpt and Claude) are evolving to display external data with widgets that will potentially drive more traffic than google has been doing for a long time. Might be worth changing focus as SEO killed google.

eitland

7 days ago

Kagi already behave really nicely, giving a small summary with numbered footnotes that links to where Kagi got its information from.

Which IMO shows that is already possible to do this and and it has been for a while already.

OtherShrezzing

7 days ago

>I don’t understand why people are ok with Google scraping their site (when it is called indexing), fine with users scraping their site (when it is called RSS reading), but suddenly not ok with AI startups scraping their site.

Google and the publisher share the value derived from that scraping event (somewhat equitably) when a user clicks a link in Google's index. OpenAI, Anthropic, and co extract all of the value, and incur costs to the publisher, from their scraping events.

loloquwowndueo

7 days ago

Tell me you haven’t had your site scraped by an abusive AI crawler without telling me you haven’t had your site scraped by an abusive AI crawler.

Google and traditional crawlers are fine. This new breed of AI crawlers have no scruples, no rate limits, and will just load pages and follow links as fast as they can, very likely overwhelming smaller web servers in the process.

I too have blocked many of them to avoid them filling my logs with nonexistent page accesses, for example.

sneak

7 days ago

I run a public Gitea at https://git.eeqj.de and while I see it getting scraped, I haven’t noticed any effects whatsoever from the additional load. No mitigations have been necessary.

tomsmeding

7 days ago

The OP notes at some point that the scrapers were generating lots of source tarballs individual commits. That takes server resources, and a small server can't sustain many of those requests per second. I wonder why you (apparently) don't suffer from this — in fact one can even wonder why such tarballs are generated in the first place, it sounds very pointless.

bob1029

6 days ago

> The OP notes at some point that the scrapers were generating lots of source tarballs individual commits.

I refuse to believe OAI has a scraper that is designed to push commits. It is obvious to me that we are conflating other forms of malicious use with the boogeyman.