FOSS infrastructure is under attack by AI companies
https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/
Is the era of anonymous browsing coming to an end?
Mar 20 · 4 months ago · 👍 LucasMW · 🙁 4
13 Comments ↓
Didn't knew about this anubis thing. Seems useful. This whole scraping bot stuff is getting completely out of control
how horrible... I think we have been seeing a little bit of this on gemini, with an increase in aggressive bots crawling capsules via proxies. I've had to block a fair few ip addresses. I just hope we can lie low enough that they don't start crawling geminispace directly. Not sure what we could do if they did -- even securing everything behind client certificate checks would only be a stop-gap measure, and I suppose we have no way of incorporating proof-of-work which wouldn't be a major pain for users.
I'm just waiting for the Terminator, as the messiah, who will restore the balance of things.
The one problem I currently see is that he is also learning from human mistakes,
so the balance may be flawed until Messiah 2.0 arrives.
At this point, we should just re-implement coinhive. It will help balance out the costs of serving things on the internet. If scrappers are billing 70 TB of usage out of webserver, they better pay for it with proof of work....
@LucasMW today, proof of work is not an option when you already have enough money and slaves. I see only invitation as the solution for humankind you can be assured are not humanoid yet;
and make sure it will not be called racism, bot-phobia, or any other legal term very soon.
@ps The idea of proof of work is mostly balance the costs of such attack. So, even if hit with several TB of bandwidth, the mantainer can still pay the server. Optimally, the balance of proof work should tilt to the mantainer, so even if attacked, someone would still be able to make a buck.
This would still be a pain to legit users, which may be forced to authenticate and pay for it, to avoid such blocks.
🌲 Half_Elf_Monk · Mar 20 at 18:22:
I was wondering to myself how to defend against this sort of thing. Could identified repeat-offender IP addresses be redirected elsewhere as a way to make it ineconomic to continue scraping this way? Like, if you recognize that a certain IP is a scraper, could an intelligent server redirect all those requests towards other target files?
So for a bunch of chineese bots, I'd think to redirect the requests so that they end up DDoSing some chinese intelligence architecture. That'll get their attention, and initiate real-people mechanisms to calm those bots down. Or redirect all of anthropics' requests back to their own public-facing website. I guess it would matter where those requests seemed/appear to be coming from after the redirect.
Dynamically redirecting requests from known scrapers sounds like a fun way to f*^& around with scrapers. My hope is that it would end up costing more compute for the scraper than the redirector. Even if that doesn't work, you could still dynamically redirect to a general pile of LLM-poison pills that diluted/ruined the data.
🐦 wasolili [...] · Mar 20 at 21:56:
my favorite part of the article is the guy complaining about the proof-of-work screen having an anime character on it because his girlfriend would be mad at him if she saw it on his computer.
mostly residential IP addresses
Probably paying a botnet operator for the privilege. Satire idea: an article purporting to be from a credit card fraudster complaining that LLM crawlers have driven the cost of residential proxies up.
I read a pretty simple suggestion fix for this on the web1.1, just blanket block all subnets assigned to the major cloud providers
@HanzBrix article says that in many cases, the flood is coming from residential IPs in unrelated subnets and that each IP makes only one request.
I can think of a few ideas. However, if you are consistent then they might change their scraping to work around it.
Confuse scrapers by using links to URLs that depend on the user-agent and IP address. If they keep changing these (which the article suggests they do), then the links will not work; instead the server can respond with an error message in plain text format, with instructions to manually find the correct URL. Perhaps limit this to requests for HTML files only.
Requiring JavaScripts, CSS, pictures, sufficiently fast computers, etc, is not a good idea in my opinion, unless perhaps it has a
You might also check if the User-Agent specifies a browser that is known to implement pictures, CSS, JavaScripts, cookies, etc, and if it claims to but doesn't, return an error message without any links, explaining what is wrong, suggesting that the user can change the user-agent setting if they deliberately disabled these features, in order that the request will be accepted.
🐦 wasolili [...] · Mar 21 at 01:37:
@HanzBrix if I read the article correctly, it's that the webscrapers used by some of the more egregious LLM companies are proxying through residential proxies (which i assume are offered by a botnet operator, given the nature of the residential proxy business)
though upon rereading I realize you were probably mentioning blocking cloud providers as a response to comments about blocking the crawlers of gemspace, not about the crawlers mentioned in the article, in which case, ignore me :)
Bumped into this today. Maybe useful? gemini://alexschroeder.ch/2025-03-21-defence-summary
That is irony,
sr.ht have js-less interface and their can't simply integrate Anubis PoW solution.
I set up port knocking on the HTTP server. (Maybe I should specify which port number to use, in the gopher and/or scorpion server, which do not themself require port knocking. Note that if you use the wrong port number, you will be locked out (in some cases), in order to prevent being accessed by port scanning.)
Source