My process is that if you go around the robots.txt, you're hostile, and you route to null on the next access. If you attempt to directly access cached URLs, you're hostile, same answer. The file of IPv4 and IPv6 addresses that have attempted this is easily a half-mile long.
Happy to add archive.org to it. Baidu, Bing, and yes, Google, are already there. Most of them have been from AWS instances snooping around. They get the same answer.