My Forgejo instance was attacked by LLM crawlers

April 3, 2025 :: 5 min read

Background

This post is a reaction to the attack that my server suffered from, that took place on the weekend of the 29th-30th of March. This happened even though my robots.txt file has the following value on all my subdomains:

User-agent: *
Disallow: /

If you’re not familiar with the robots.txt file, the general idea behind it is to tell bots which resources they are allowed to see or not, as well as which bots are allowed to see it. In my case, I don’t want any bot to be able to see anything on my websites.

It is worth noting that the robots.txt file works in a similar way to the DNT (Do-Not-Track) HTTP header, meaning that it is up to the service provider to decide whether they will respect the user’s choice. Thus, what usually should happen is that bots make a single, initial requests to /robots.txt, see that I don’t want them to crawl my server, so they give up and stop making requests to my server.

However, it is now well-established that LLM companies do not care about consent at all. In fact, most LLM crawlers simply ignore the content of robots.txt. Many developers are currently struggling with attacks from LLM crawlers, up to the point where some FOSS service providers even had to temporarily block entire countries to deal with it [1].

What happened on my server

What happened is that I decided to take a break this weekend. I wasn’t able to do much work on my server remotely, because I didn’t have access to my physical computer. Of course, this had to be the weekend where LLM crawlers discovered my Forgejo instance. Because it is a private, single-user instance, I wasn’t really prepared for a such situation.

The attack started saturday in the afternoon, with several requests from different IP addresses to the /robots.txt path. Some bots stopped making requests after reading this file, but most of them continued anyway. Some bots even started making requests without first asking for the robots.txt file.

During the whole weekend, each one of these bots made multiple requests per second to expensive endpoints of my Forgejo instance, such as diffs between commits, code archives for every commit, etc. It amounted to around 30 times the regular traffic on my server. Requests were made to every possible endpoint on the server, and it never stopped for one second. It went on and on, requesting endpoints that were already requested several times before, just like if an infinite loop was set to request every possible URL on the site.

Not only did this eat a lot of resources, it also filled the storage of my server with unnecessary data, because it had to generate code archives for every commit in every branch of every repository. While the disk usage of my instance is usually less than 100 MB, it suddenly jumped up to several GB.

How I stopped the attack

Fortunately for me, my server was still responsive, so I was able to ssh in and to discover what was going on. Unfortunately, I was able to do that Sunday night only. Once logged in, I immediately understood that I wasn’t facing a DDoS attack, but rather a DoS attack from LLM crawlers.

After taking a peek at the logs, I started blocking several IP addresses manually with my firewall, to try and stop the attack temporarily. It worked, but I knew that this was just a temporary solution, because there are a lot of other bots out there. Also, I was pretty sure that these crawlers would probably change their IP addresses and attack my server again sooner or later.

So, I started tweaking my nginx configuration. What I wanted to do was to prevent LLM crawlers from accessing any endpoint of my Forgejo instance. What the configuration does is that it checks the user-agent of HTTP requests against an aggressive regex that matches most user-agents of known LLM bots. If the user-agent matches the regex, nginx returns a 403 error instead of the regular page. This first measure was taken to ensure that bots wouldn’t be able to access expensive endpoints.

Next, I wrote a fail2ban filter with the exact same regex. This made is possible to instantly block hosts that make requests containing a user-agent known to be a LLM bot. It works very well, since it managed to block over 30 bots in the last few days.

The downsides of my restrictions

Even though my restrictions seem to work for now, there are a few downsides to the approach taken. First of all, it is proven that some LLM bots hide their user-agent to make it look like they’re human visitors. Also, since the regex that I wrote is pretty aggressive, it may block some “legitimate” bots, such as search engine indexers or social media scrapers.

If these issues end up being problematic, I will probably set up something like Anubis [2] to protect my server.

In conclusion

If I learned one thing from this attack, it is that I was completely right about LLMs from the start. I’ve never liked it, and I’ve avoided using it since the very beginning. This is also a reminder that the software I maintain, TBlock, has a policy that declines any contribution suspected to be made by LLMs.

To hell with their so-called “AI”.

Footnotes

[1]: FOSS infrastructure is under attack by AI companies

[2]: Block AI scrapers with Anubis


Source