Anti AI scraper — andreasglashauser.com

Before LLMs, web crawling operated on broadly accepted conventions. Search engine indexers usually respected robots.txt published by site owners. With AI, this changed. Operators now struggle with significant disruptions as some crawlers scale their requests to levels that mimic DDoS attacks. Wikipedia’s media-download bandwidth had grown 50% because of automated scrapers collecting content for AI models (Source: https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-the-operations-of-the-wikimedia-projects/). Additionally, bots were responsible for 65% of its most expensive traffic, despite being “only” about 35% of total pageviews, because they scraped uncached pages that are costlier to serve. Last year, LWN was dealing with AI scraperbots that spread requests across literally millions of IP addresses, which circumvented simple rate limits.

Possible defenses can be split into two broad categories. The first is friction: the client proves it is a browser by executing JavaScript that solves a “puzzle” or, in other words, it invests compute. Anubis is the most popular project implementing this approach. It is a web AI firewall utility using challenges including a SHA-256 proof-of-work challenge (if you hear proof-of-work and think of bitcoin, you’re right: Anubis was inspired by Hashcash, which also inspired Bitcoin). The tradeoff is that it is practically a nuclear response, and it may also inconvenience good bots such like the Internet Archive, for example.

The second category is deception. Instead of saying no to a bot accessing your site, you give the scraper poison. Cloudflare’s AI Labyrinth is an example of that idea. When Cloudflare suspects unauthorized AI crawling, it can serve hidden links to AI-generated decoy pages that look plausible enough to traverse but are irrelevant to the protected site. Those sites are invisible to human visitors. There is also a selfhosted version of if called Quixotic. Its a content obfuscator that rewrites a copy of your site with Markov-generated substitutions and also scrambles a portion of all the images on a site. Nepenthes goes even futher and calls itself a tarpit for web crawles, specifically targeting LLM scrapers. Its pages are generated in an endless deterministic sequence, packed with many links that lead back into the tarpit and padded with Markov-generated “babble”. Both self-hostable approaches can lead to a site disappearing completely from search results.

I do not like this direction. I would much rather live on a web where robots.txt and sane rate limits were enough, but here we are. For sites like Wikipedia it’s especially annoying, as they literally provide datadumps that could be used instead.