plastic041
a day ago
> Avoid detection with built-in anti-bot patches and proxy configuration for reliable web scraping.
And it doesn't care about robots.txt.
andrew_zhong
a day ago
Good point. The anti-bot patches here (via Patchright) are about preventing the browser from being detected as automated — things like CDP leak fixes so Cloudflare doesn't block you mid-session. It's not about bypassing access restrictions.
Our main use case is retail price monitoring — comparing publicly listed product prices across e-commerce sites, which is pretty standard in the industry. But fair point, we should make that clearer in the README.
plastic041
a day ago
robots.txt is the most basic access restrictions and it doesn't even read it, while faking itself as human[0]. It is about bypassing access restrictions.
[0]: https://github.com/lightfeed/extractor/blob/d11060269e65459e...
zendist
a day ago
Regardless. You should still respect robots.txt..
andrew_zhong
a day ago
We do respect robots.txt production - also scraping browser providers like BrightData enforces that.
I will add a PR to enforce robots.txt before the actual scraping.
plastic041
a day ago
How can people believe that you are respecting bot detection in production when your software's README says it can "Avoid detection with built-in anti-bot patches"?
andrew_zhong
16 hours ago
I hear you loud and clear - will replace the stealth browser with plain playwright and remove anti-bot as a feature.
messe
a day ago
> It's not about bypassing access restrictions.
Yes. It is. You've just made an arbitrary choice not to define it as such.
andrew_zhong
a day ago
I will add a PR to enforce robots.txt before the actual scraping.
messe
11 hours ago
Or just follow web standards and define and publish your User-Agent header, so that people can block that as needed.
You're creating the wrong kind of value. I really hope your company fails, as its success implies a failure of the web in general.
I wish you the best success outside of your current endeavour.