Before LLMs, web crawling operated on broadly accepted
conventions. Search engine indexers usually respected
robots.txt published by site owners. With AI,
this changed. Operators now struggle with significant
disruptions as some crawlers scale their requests to levels
that mimic DDoS attacks. Wikipedia’s media-download bandwidth
had grown 50% because of automated scrapers collecting content
for AI models (Source:
https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-the-operations-of-the-wikimedia-projects/).
Additionally, bots were responsible for 65% of its most
expensive traffic, despite being “only” about 35% of total
pageviews, because they scraped uncached pages that are
costlier to serve. Last year, LWN was dealing with
AI scraperbots that spread requests across literally millions
of IP addresses, which circumvented simple rate
limits.
Possible defenses can be split into two broad categories. The first is friction: the client proves it is a browser by executing JavaScript that solves a “puzzle” or, in other words, it invests compute. Anubis is the most popular project implementing this approach. It is a web AI firewall utility using challenges including a SHA-256 proof-of-work challenge (if you hear proof-of-work and think of bitcoin, you’re right: Anubis was inspired by Hashcash, which also inspired Bitcoin). The tradeoff is that it is practically a nuclear response, and it may also inconvenience good bots such like the Internet Archive, for example.
The second category is deception. Instead of saying no to a bot accessing your site, you give the scraper poison. Cloudflare’s AI Labyrinth is an example of that idea. When Cloudflare suspects unauthorized AI crawling, it can serve hidden links to AI-generated decoy pages that look plausible enough to traverse but are irrelevant to the protected site. Those sites are invisible to human visitors. There is also a selfhosted version of if called Quixotic. Its a content obfuscator that rewrites a copy of your site with Markov-generated substitutions and also scrambles a portion of all the images on a site. Nepenthes goes even futher and calls itself a tarpit for web crawles, specifically targeting LLM scrapers. Its pages are generated in an endless deterministic sequence, packed with many links that lead back into the tarpit and padded with Markov-generated “babble”. Both self-hostable approaches can lead to a site disappearing completely from search results.
I do not like this direction. I would much rather live on a
web where robots.txt and sane rate limits were
enough, but here we are. For sites like Wikipedia it’s
especially annoying, as they literally provide
datadumps that could be used instead.