Amazonbot is finally respecting robots.txt

163 pointsposted 18 hours ago
by xena

46 Comments

phdelightful

16 hours ago

I just put Anubis in front of my self-hosted forge this morning because AmazonBot had helped itself to 750 GiB (!) of traffic to my public repos this month!

At least, it claimed to be AmazonBot…

faangguyindia

40 minutes ago

In my logs it appears like this:

BOT","cluster_name":"EU","cluster_region":"EU","connection_type":"corporate","country":"US","device_type":"ROBOT","duration_ms":0.391,"duration_us":391,"filter":"","ip":"52.1.106.130","isp":"Amazon.com, Inc.","level":"info","msg":"Request evaluated","org":"Amazon.com, Inc.","os":"","ref":"","region":"Virginia","result":false,"time":"2026-05-15T13:33:20Z","ua":"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36","why":"bot"}

3.227.180.70

23.21.175.228

23.23.137.202

from all these IPs.

Bender

15 hours ago

Are they in this space? [1] One could map the ranges into a web daemon and rate limit them or just 'ip route add blackhole ${cidr}' each cidr block.

[1] - https://ip-ranges.amazonaws.com/ip-ranges.json

rnhmjoj

8 hours ago

I just do this for the IP ranges of Amazon, OpenAI, Huawei and other companies that run these insane crawlers: it's 100% effective and it doesn't annoy real users with a captcha or some PoW thing. There's simply no reason for them to reach my homeserver other than to scrape the hell out of it.

phdelightful

2 hours ago

I didn't check thoroughly, but the first one I happened to grep out was not on that list:

"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36"

"x-forwarded-for":"44.210.204.255" "x-real-ip":"44.210.204.255"

This is a bit outside my area of expertise, so I don't know how reliable these x-forwarded-for and x-real-ip are.

Bender

2 hours ago

One of the places to look it up would be bgp.tools [1] The IP is purported to belong to Amazon and the ASN has some interesting tags. [2] Any form of forwarded-for can be spoofed and should only be considered from expected up-stream proxies such as a CDN and they should have a CDN specific IP header that would be listed in their documentation. Typically the first column in access logs will be the REMOTE_ADDR which is the actual network connection but if using a CDN that would be the CDN IP.

If a CDN does not have an option to block cloud and Tor CIDR blocks then that should be a feature request.

44.210.204.255 is included in 44.192.0.0/10 which is listed in the AWS CIDR ranges. Use one of the online subnet calculators to find IP ranges of CIDR blocks. This is likely a Tor exit node.

Blocking the CIDR blocks I listed in the thread would have included this node as well. Here [3] are a few shell functions for getting some of the cloud CIDR blocks. I must have been inebriated when I wrote those. This site may not be reachable during blood moons or when the nanosecond is divisible by zero.

Here [4a][4b] are a couple decent subnet calculators. There are some command line tools for playing with CIDR blocks and IP addresses to see if an IP is included in a CIDR block but this varies by Linux distribution so perhaps look for a generic python script.

To get a list of Tor exit nodes to blackhole route, look at [5]. This updates often. Just clone the entire repo. Unless your site is related to government dissent or anonymous porn then most traffic from Tor exit nodes will likely just be bots and thus riff-raff.

Seconds after I linked realhackers bots showed up and got a zero byte response. Poor lil HN servers must get a lot of trash non stop. I hope I get some delicious bots today.

[1] - https://bgp.tools/

[2] - https://bgp.tools/as/14618

[3] - https://ai.realhackers.org/_get_cloud_cidr.txt

[4a] - https://mxtoolbox.com/subnetcalculator.aspx

[4b] - https://www.vultr.com/resources/subnet-calculator/

[5] - https://github.com/firehol/blocklist-ipsets/blob/master/clea...

Symbiote

7 hours ago

That's all of Amazon AWS, not just Amazon's AI system.

Bender

3 hours ago

Yup, mostly. There are more ranges for the Amazon store too.

It would be rather nifty if Amazon and other companies would confine AI to specific CIDR or a dedicated ASN but I would not hold my breath on that one. AI crawlers will likely muddy the waters for everyone else.

lofaszvanitt

7 hours ago

That list is a tad bit too long. Why don't they enforce a rule on these big corps to publicly state which range does what.

Bender

3 hours ago

That would indeed by handy but I think the answer is that people would block specific ranges. By not segmenting into specific groups people are forced to either:

- play the game of whack-a-mole

- use difficult implementations of user validation checks that potentially cause pain for real humans

- block all Amazon CIDR blocks which they know most corporations will not do.

This forces the majority to just tolerate whatever comes out of their networks.

userbinator

11 hours ago

At least, it claimed to be AmazonBot

It's good that you mentioned this; smear campaigns are definitely not a new thing, and I suspect a lot of this DDoS'ing that's going on is a plot to accelerate towards Big Tech's authoritarian dystopia. Basically extortion.

faangguyindia

9 hours ago

i see the bots with user agent claude bot, using AWS IPs.

I've also seen Google bots with AWS IP ranges. You gotta look at their ASN/ISP/ORG

nathanmills

15 hours ago

Do you have a robots.txt?

xena

15 hours ago

> We are writing to inform you that starting Monday, June 15, 2026, crawl preferences for Amazonbot will be managed solely through the industry-standard directives.

They will in the future, but not today.

jacobn

17 hours ago

I just complained to them the other day! They were scraping our weather website to no end, very much including the disallowed path prefixes.

Did end up just adding them to our WAF blocklist, which is weirdly ironic - hosting on their infra & using their services to block their AI scraper...

BLKNSLVR

16 hours ago

I hope you leave it on the WAF. If they're only just deciding to respect robots.txt, which has been internet infrastructure forever, then it's probably still incredibly amateur software with 'Amazon-priorities' rather than 'responsible internet traffic' priorities.

tardedmeme

12 hours ago

The responsible internet is dead. Every big actor on the internet is selfish now that there's money involved. And has been for 20 years.

Google only respected it because blocking Google from crawling your site used to hurt you more than it hurt Google.

BLKNSLVR

10 hours ago

Time to switch to allow lists instead of block lists...

adrianvi

15 hours ago

step 1: create the problem, step 2: sell the solution, step 3: profit

bstsb

17 hours ago

> Get Outlook for Mac

this bit made me laugh. was the email drafted in Outlook? was it sent to some sort of forwarding mailbox, or did they just BCC every customer in?

jdiff

12 hours ago

> Looking at the email headers it has a bunch of Exchange-specific headers so it's probably actually from Outlook for Mac.

My guess would be some sort of internal forwarding mailing list, yeah.

captn3m0

16 hours ago

Good place to ask, saw a new AWS User agent in logs today: Amazon-Quick-on-Behalf-of-$HEXID

I found a mention on some user agent trackers but no official documentation. Anyone knows if it’s documented? Asking because I am seeing decent traffic (30GB/week) from this.

embedding-shape

15 hours ago

Came across this recently too, seems to be from "Amazon Quick" where crawling other's websites is basically a feature of the product: https://docs.aws.amazon.com/quick/latest/userguide/web-crawl...

> Crawling behavior [...] Crawler identification: Identifies itself with user-agent string "aws-quick-on-behalf-of-<UUID>" in request headers.

Maybe people found a way of using it as a loophole for something or Amazon Quick is just picking up in usage, and your website is popular amongst whoever uses that sort of stuff.

iLoveOncall

16 hours ago

Amazon Quick is the new name of Quicksight, which is the BI tool from AWS.

It has AI agents included so I guess this can just come from it searching the web based on user requests.

TurdF3rguson

16 hours ago

Why does Amazonbot even exist, can someone explain? I don't understand why an ecommerce play would be crawling other websites.

input_sh

16 hours ago

To train AI. Not even a hyperbole, that is the only concrete example they list in their explanation: https://developer.amazon.com/amazonbot

> Amazonbot is used to improve our products and services. This helps us provide more accurate information to customers and may be used to train Amazon AI models.

tardedmeme

12 hours ago

It would be more fun to respond with false data.

tintor

16 hours ago

To ensure Amazon marketplace sellers aren't offering lower prices on other ecommerce websites. Also AI.

b112

16 hours ago

I was wondering about this. And it makes me think this is all mistruth, unless they plan to drop this pricing tactic.

They've been getting some heat on it lately, but I find it hard to believe they're going to give up entirely? And if so, what's to stop someone from just flouting their rules on pricing, and then doing the robots.txt thing to prevent issues?

embedding-shape

16 hours ago

Amazonbot is specifically the user agent they use for crawling for "provide more accurate information to customers" (whatever that means, could be anything it sounds like) and also when they scrape for data used in AI training, according to https://developer.amazon.com/amazonbot

TrackerFF

14 hours ago

Is it just me, or is it extra unethical and self-serving when crawlers from say Amazon(Bot) decides to incessantly crawl AWS hosted websites? Same goes for Google and Microsoft crawlers crawling GC and Azure.

By that, I mean the types of crawls that can hog up significant usage.

arjie

17 hours ago

Huh, I get a lot of traffic from Amazonbot (relative to humans) and try as I might, it would get stuck in a tarpit of no creation because it would sit there and keep blasting every variation of my recent pages because Mediawiki lists many links. I have them appropriately nofollow and warning the bot not to waste its time with robots.txt but it just goes and sticks itself on nonsense internal pages.

The traffic isn't a problem. I've got Cloudflare in front and the machine itself is relatively overpowered, and downtime isn't critical. But I'd just like the thing to be able to spider me properly. Someone did point out to me that maybe I wasn't receiving actual Amazonbot but some other spider: https://news.ycombinator.com/item?id=46352723

namegulf

17 hours ago

Robots.txt is lame BTW, there is no way to enforce it. It is up to the bot to decide to crawl or not and most cases they don't care.

Cloudflare had a nice technic to address the bot problem (if you use their name servers). It'll respect and use the robots.txt while sending the remaining bots to a deep black hole.

input_sh

16 hours ago

Yes, we know, its purpose is to guide the bots, not forcibly block them.

That said, one of the biggest websites in the world not respecting it is definitely a noteworthy story. Hopefully another one of the biggest websites in the world (formerly known as Twitter) eventually respects it as well instead of not even disclosing itself via a user agent and pretending to be Safari running on iOS.

namegulf

16 hours ago

Why down vote a comment?

You're talking about one (yes, biggest) but millions of other bots don't follow must be a bigger story.

marginalia_nu

15 hours ago

Robots.txt is great if you're trying to run an above board operation. Much easier than trying to guess how a webmaster wishes the crawler to behave, and then getting angry emails when you guess wrong.

tardedmeme

12 hours ago

It's not great. It used to be very common that robots.txt would Disallow *, Allow GoogleBot which just entrenches the search engine monopoly. In response to this other search engines just used the rules for GoogleBot instead of the rules for their own crawlers.

marginalia_nu

6 hours ago

Eh, not really my experience running an internet search engine and a crawler. It happens occasionally, but mostly people seem to focus on what they perceive as nuisance crawlers if they do disallow any specific UAs.

llbbdd

16 hours ago

Yeah, robots.txt is a great herald example of the type of solution invented by people who don't understand incentives whatsoever.

Ferret7446

12 hours ago

robots.txt is a great herald example of people misunderstanding and misusing a tool. The file was designed to help crawlers, by pointing them to the most valuable to index content and help them avoid wasting resources on useless pages.

The people trying to use it to block or limit bots are uninformed and/or misinformed.

rho138

12 hours ago

If it respected the standard then a lack of a robits.txt implies do not crawl, which they openly state they ignore

faangguyindia

9 hours ago

if you run Meta Ads, it's notorious for ddosing your website with bots. Basically, their ad manager sends dozens of click for each variant of ad you post.

vindin

15 hours ago

robots.txt is merely a gentleman’s courtesy at this point. Nobody is obligated to follow it.