>" LLMs doing crawling? That might be ill-behaved, bot not an "attack".
Some of us will think of it as an attack when the bots ignore robots.txt (or honor changes very slowly), masquerade intentionally as something they are not, and use tons of different addresses hitting the same site, especially when it is continuous. I discovered this, myself, on a small internet-connected club server later last year. The mediawiki site was becoming unresponsive and throwing errors. On investigation, we were having dozens of http requests per second, from Amazon and Bytedance. Every one of them was coming from a different IP address. Only our main page was allowed in robots.txt, so SOMETHING would end up on search sites, but the bots didn't care. I changed it to ignore everything on the site, instead of just the main page, but that apparently isn't checked very often. It took me hours of manually banning over a thousand IP addresses before the server could reasonably respond to web requests again.
Example hit:
47.128.50.93 - - [22/Sep/2024:15:03:05 -0400] "GET /mediawiki/index.php?days=30&from=20240920012115&title=Special%3ARecentChanges HTTP/1.1" 200 10111 "-" "Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com)"
Eventually, the overwhelming majority of the "robots" (web scrapers) did honor robots.txt disallowing everything, some just did so very lazily and took days of attempted hammering us before stopping. I still haven't removed the IP blocks nor put robots.txt back to allowing just the main page.