fancyfredbot
10 hours ago
Who are these agressive scrapers run by?
It is difficult to figure out the incentives here. Why would anyone want to pull data from LWN (or any other site) at a rate which would cause a DDOS like attack?
If I run a big data hungry AI lab consuming training data at 100Gb/s it's much much easier to scrape 10,000 sites at 10Mb/s than DDOS a smaller number of sites with more traffic. Of course the big labs want this data but why would they risk the reputational damage of overloading popular sites in order to pull it in an hour instead of a day or two?
ccgreg
a minute ago
One way to figure that out is to look at which companies claim to have foundation models, but no one knows what their crawler is named.
I also suspect that there are a bunch of sub-contractors involved, working for companies that don't supervise them very carefully.
overfeed
9 hours ago
> If I run a big data hungry AI lab consuming training data at 100Gb/s it's much much easier to...
You are incorrectly assuming competency, thoughtful engineering and/or some modicum of care for negative externalities. The scraper may have been whipped up by AI, and shipped an hour later after a quick 15-minute test against en.wikipedia.org.
Whoever the perpetrator is, they are hiding behind "residential IP providers" so there's no reputational risks. Further, AI companies already have a reputation for engaging in distasteful practices, but popular wisdom claims that they make up for the awfulness with utility, so even if it turns out to be a big org like OpenAI or Anthropic, people will shrug their shoulders and move on.
fancyfredbot
8 hours ago
Yes I agree it's more likely incompetence than malice. That's another reason I don't think it's a lab. Even if you don't like the big labs you can probably admit they are reasonably smart/competent.
Residential IP providers definitely don't remove reputational risk. There are many ways people can find out what you are doing. The main one being that your employees might decide to tell on you.
The IP providers are a great way of getting around cloud flare etc. They are also reasonably expensive! I find it very plausible that these IP providers are involved but I still don't understand who is paying them.
jacobgkau
8 hours ago
This is just an anecdote, but having been dealing with similar problems on one of my websites for the past year or so, I was experiencing a huge number of hits from different residential IP addresses (mostly Latin American) at the same time once every 5-10 minutes (which started crashing my site regularly). Digging through my server's logs and watching them in real-time, I noticed one or two Huawei IP's making requests at the same time as the dozens or hundreds of residential IP's. Blocking the Huawei IP's seemed to mysteriously cut back the residential IP requests, at least for a short amount of time (i.e. a couple of hours).
This isn't to say every attack that looks similar is being done by Huawei (which I can't say for certain, anyway). But to me, it does look an awful lot like even large organizations you'd think would be competent can stoop to these levels. I don't have an answer for you as to why.
dannyobrien
9 hours ago
I've been asking this for a while, especially as a lot of the early blame went on the big, visible US companies like OpenAI and Anthropic. While their incentives are different from search engines (as someone said early on in this onslaught, "a search engine needs your site to stay up; an AI company doesn't"), that's quite a subtle incentive difference. Just avoiding the blocks that inevitably spring up when you misbehave is a incentive the other way -- and probably the biggest reason robots.txt obedience, delays between accesses, back-off algorithms etc are widespread. We have a culture that conveys all of these approaches, and reciprocality has its part, but I suspect that's part of the encouragement to adopt them. It could that they're just too much of a hurry not to follow the rules, or it could be others hiding behind those bot-names (or others). Unsure.
Anyway, I think the (currently small[1]) but growing problem is going to be individuals using AI agents to access web-pages. I think this falls under the category of the traffic that people are concerned about, even though it's under an individual users' control, and those users are ultimately accessing that information (though perhaps without seeing the ads that pay of it). AI agents are frequently zooming off and collecting hundreds of citations for an individual user, in the time that a user-agent under manual control of a human would click on a few links. Even if those links aren't all accessed, that's going to change the pattern of organic browsing for websites.
Another challenge is that with tools like Claude Cowork, users are increasingly going to be able to create their own, one-off, crawlers. I've had a couple of occasions when I've ended up crafting a crawler to answer a question, and I've had to intervene and explicitly tell Claude to "be polite", before it would build in time-delays and the like (I got temporarily blocked by NASA because I hadn't noticed Claude was hammering a 404 page).
The Web was always designed to be readable by humans and machines, so I don't see a fundamental problem now that end-users have more capability to work with machines to learn what they need. But even if we track down and sucessfully discourage bad actors, we need to work out how to adapt to the changing patterns of how good actors, empowered by better access to computation, can browse the web.
[1] - https://radar.cloudflare.com/ai-insights#ai-bot-crawler-traf...
dannyobrien
9 hours ago
(and if anyone from Anthropic or OpenAI is reading this: teach your models to be polite when they write crawlers! It's actually an interesting alignment issue that they don't consider the externalities of their actions right now!)
pstuart
9 hours ago
Hell, they should at least be caching those requests rather than hitting the endpoint on every single AI request that needs the info.
philipkglass
10 hours ago
I don't think that most of them are from big-name companies. I run a personal web site that has been periodically overwhelmed by scrapers, prompting me to update my robots.txt with more disallows.
The only big AI company I recognized by name was OpenAI's GPTBot. Most of them are from small companies that I'm only hearing of for the first time when I look at their user agents in the Apache logs. Probably the shadiest organizations aren't even identifying their requests with a unique user agent.
As for why a lot of dumb bots are interested in my web pages now, when they're already available through Common Crawl, I don't know.
iamnothere
10 hours ago
Maybe someone is putting out public “scraper lists” that small companies or even individuals can use to find potentially useful targets, perhaps with some common scraper tool they are using? That could explain it? I am also mystified by this.
velox_neb
10 hours ago
I bet some guy just told Claude Code to archive all of LWN for him on a whim.
tux3
9 hours ago
Some guy doesn't show up with 10k residential IPs. This is deliberate and organized.
slicerdicer2
8 hours ago
There are multiple israeli companies who will provide you with millions of residential proxies at a per gb usage rate and a very easy API. You can set this up in minutes with claude code.
fancyfredbot
7 hours ago
These IP providers aren't cheap (cost per GB seems to be $4 but there are bulk discounts). The cost to grab all of LWN isn't prohibitively high for an individual but it's enough that most people probably wouldn't do it on a whim.
I suppose it only needs one person though. So it's probably a pretty plausible explanation.
kleene_op
9 hours ago
LLMs just do be paperclipping
chrisjj
9 hours ago
Can Claude Code even do that? Rather than provide code to do that.
suburban_strike
6 hours ago
When faced with evidence of operating procedure for the malicious, we forever take them at their word when they insist they're just incompetent.
The spirit of this site is so dead. Where are the hackers? Scraping is the best anyone is coming up with?
It's not scraping. They'd notice themselves getting banned everywhere for abuse of this magnitude, which is counterproductive to scraping goals. Rather than rate-limit the queries to avoid that attention, they're going out of their way to (pay to?) route traffic through a residential botnet so they can sustain it. This is not by accident, nor a byproduct of sloppy code Claude shat out. Someone wants to operate with this degree of aggressiveness, and they do not want to be detected or stopped.
This setup is as close to real-time surveillance as can be. Someone really wants to know what is being published on target sites with as minimal a refresh rate as possible and zero interference. It's not a western governmental entity or they'd just tap it.
As for who...there's only one group on the planet so obsessed with monitoring and policing everything everyone else is doing.
ofrzeta
4 hours ago
Recently I needed to block some scrapers to execessive load on a server, and here are some that I identified:
BOTS=( "semrushbot" "petalbot" "aliyunsecbot" "amazonbot" "claudebot" "thinkbot" "perplexitybot" "openai.com/bot" )
This was really just emergency blocking and it included more than 1500 IP addresses.
Here's Amazon's page about their bot with more information including IP addresses
bjackman
10 hours ago
LWN includes archives of a bunch of mailing lists so that might be a factor. There are a LOT of web on that domain.
phil21
8 hours ago
I'd guess some sort of middle management local maxima. Someone set some metric of X pages per day scraped, or Y bits per month - whatever. CEO gets what he wants.
Then that got passed down to the engineers and those engineers got ridden until they turned the dial to 11. Some VP then gets to go to the quarterly review with a "we beat our data ingestion metrics by 15%!".
So any engineer that pushes back basically gets told too bad, do it anyways.
debo_
8 hours ago
Why is it in these invented HN scenarios that the engineers just happen to have absolutely no agency?
phil21
6 hours ago
Because I've personally seen it. Engineer says this is silly, it will blow up in the long run - told to implement it anyways. Not much to lose for the engineer to simply do it. Substitute engineer for any line level employee in any industry and it works just as well.
I've also run into these local maxima stupidities dozens or more time in my career where it was obvious someone was gaming a performance metric at the expense of the bigger picture - which required escalation to someone who could see said bigger picture to get fixed. Happens all the time as a customer where some sales rep or sales manager wants to game short-term numbers at the expense of long-term relationships. Smaller companies you can usually get it fixed pretty quickly, larger companies tend to do more doubling down.
It usually starts with generally well-intentioned goal setting but devolves into someone optimizing a number on a spreadsheet without care (or perhaps knowledge) of the damage it can cause.
Hell, for the most extreme example look at Dieselgate. Those things don't start from some evil henchman at the top saying "lets cheat and game the metrics" - it often starts with someone setting impossible to achieve goals unknowingly in service of "setting the bar high for the organization", and by the time the backpressure filters up through the org it's oftentimes too late to fix the damage.
oblio
8 hours ago
Because: who would refuse more money?
fancyfredbot
8 hours ago
I don't think this evil boss and downtrodden engineer situation can explain what we're seeing.
Your theoretical engineers would figure out pretty quickly that crashing a server slows you down and the only way to keep the boss happy is to avoid the DDOS.
ks2048
9 hours ago
Perhaps incompetence instead of malice - a misconfigured or buggy scraper, etc.
mikkupikku
10 hours ago
NSA, trying to force everybody onto their Cloudflare reservation.
delfinom
8 hours ago
As someone that runs the infrastructure for a large OSS project. Mostly Chinese AI firms. All the big name brand AI firms play reasonably nice and respect robots.txt.
The Chinese ones are hyper aggressive, with no rate limit and pure greed scraping. They'll scrape the same content hundreds of times the same day
suburban_strike
6 hours ago
The Chinese are also sloppy. They will run those scrapers until they get banned and not give a fuck.
In my experience, they do not bother putting in the effort to obfuscate source or evade bans in the first place. They might try again later, but this particular setup was specifically engineered for resiliency.
bediger4000
6 hours ago
Is this an example of that "chabuduo" we read about now and then?
rfmoz
8 hours ago
Chinese AI is doing large amounts of request in the past weeks.
tjons
6 hours ago
how is this showing up for you? site you host or bigger scale? I'm not surprised but rather curious.
alephnerd
9 hours ago
> If I run a big data hungry AI lab consuming training data at 100Gb/s it's much much easier to scrape 10,000 sites at 10Mb/s than DDOS a smaller number of sites with more traffic
A little over a decade ago (f*ck I'm old now [0]), I had a similar conversation with an ML Researcher@Nvidia. Their response was "even if we are overtraining, it's a good problem to have because we can reduce our false negative rate".
Everyone continues to have an incentive to optimize for TP and FP at the expense of FN.
kylehotchkiss
10 hours ago
china (alibaba and tencent)
fancyfredbot
10 hours ago
I'm not at all sure alibaba or tencent would actually want to DDOS LWN or any other popular website.
They may face less reputational damage than say Google or OpenAI would but I expect LWN has Chinese readers who would look dimly on this sort of thing. Some of those readers probably work for Alibaba and Tencent.
I'm not necessarily saying they wouldn't do it if there was some incentive to do so but I don't see the upside for them.