-
How are you blocking them? You have any kind of rate limiting?
Coz these guys appear to be ignoring robot restrictions and crawling stuff anyway using IPs they're not publishing
"WIRED was able to confirm that a server at the IP address Knight observed—44.221.181.252—will, on demand, visit and download webpages when a user asks Perplexity about the webpage, regardless of what the site’s robots.txt says."
https://www.wired.com/story/perplexity-is-a-bullshit-machine/
-
- User agent blocks
- ASN blocks
- IP blocks
- Rate limits
- HTTP header + TLS cipher blocks
I also monitor daily, and if I see anything evade, I block it.
But anything successfully scraped before these will exist for a while until it's considered stale.
I return HTTP 403 to the majority of things... But did redirect some stuff to large random files and a couple of weeks ago accidentally served 2.3PB in 6 hours to the Facebook scraper.
The blocks are effective.
Attached you can see a scraper that hit us at 20:00 UTC yesterday, and it was effectively blocked.
- User agent blocks
All bots are blocked, all scrapping is blocked.
It wasn't originally, but there's a thread somewhere and I slammed closed the gates.