FOSS infrastructure is under attack by AI companies

https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/

Is the era of anonymous browsing coming to an end?

Mar 20 · 4 months ago · 👍 LucasMW · 🙁 4

13 Comments ↓

Didn't knew about this anubis thing. Seems useful. This whole scraping bot stuff is getting completely out of control

how horrible... I think we have been seeing a little bit of this on gemini, with an increase in aggressive bots crawling capsules via proxies. I've had to block a fair few ip addresses. I just hope we can lie low enough that they don't start crawling geminispace directly. Not sure what we could do if they did -- even securing everything behind client certificate checks would only be a stop-gap measure, and I suppose we have no way of incorporating proof-of-work which wouldn't be a major pain for users.

👻 ps · Mar 20 at 16:18:

I'm just waiting for the Terminator, as the messiah, who will restore the balance of things.

The one problem I currently see is that he is also learning from human mistakes,

so the balance may be flawed until Messiah 2.0 arrives.

🚀 LucasMW · Mar 20 at 16:28:

At this point, we should just re-implement coinhive. It will help balance out the costs of serving things on the internet. If scrappers are billing 70 TB of usage out of webserver, they better pay for it with proof of work....

👻 ps · Mar 20 at 16:41:

@LucasMW today, proof of work is not an option when you already have enough money and slaves. I see only invitation as the solution for humankind you can be assured are not humanoid yet;

and make sure it will not be called racism, bot-phobia, or any other legal term very soon.

🚀 LucasMW · Mar 20 at 17:36:

@ps The idea of proof of work is mostly balance the costs of such attack. So, even if hit with several TB of bandwidth, the mantainer can still pay the server. Optimally, the balance of proof work should tilt to the mantainer, so even if attacked, someone would still be able to make a buck.

This would still be a pain to legit users, which may be forced to authenticate and pay for it, to avoid such blocks.

🌲 Half_Elf_Monk · Mar 20 at 18:22:

I was wondering to myself how to defend against this sort of thing. Could identified repeat-offender IP addresses be redirected elsewhere as a way to make it ineconomic to continue scraping this way? Like, if you recognize that a certain IP is a scraper, could an intelligent server redirect all those requests towards other target files?

So for a bunch of chineese bots, I'd think to redirect the requests so that they end up DDoSing some chinese intelligence architecture. That'll get their attention, and initiate real-people mechanisms to calm those bots down. Or redirect all of anthropics' requests back to their own public-facing website. I guess it would matter where those requests seemed/appear to be coming from after the redirect.

Dynamically redirecting requests from known scrapers sounds like a fun way to f*^& around with scrapers. My hope is that it would end up costing more compute for the scraper than the redirector. Even if that doesn't work, you could still dynamically redirect to a general pile of LLM-poison pills that diluted/ruined the data.

🐦 wasolili [...] · Mar 20 at 21:56:

my favorite part of the article is the guy complaining about the proof-of-work screen having an anime character on it because his girlfriend would be mad at him if she saw it on his computer.

mostly residential IP addresses

Probably paying a botnet operator for the privilege. Satire idea: an article purporting to be from a credit card fraudster complaining that LLM crawlers have driven the cost of residential proxies up.

I read a pretty simple suggestion fix for this on the web1.1, just blanket block all subnets assigned to the major cloud providers

@HanzBrix article says that in many cases, the flood is coming from residential IPs in unrelated subnets and that each IP makes only one request.

🦂 zzo38 · Mar 20 at 23:54:

I can think of a few ideas. However, if you are consistent then they might change their scraping to work around it.

Confuse scrapers by using links to URLs that depend on the user-agent and IP address. If they keep changing these (which the article suggests they do), then the links will not work; instead the server can respond with an error message in plain text format, with instructions to manually find the correct URL. Perhaps limit this to requests for HTML files only.

Requiring JavaScripts, CSS, pictures, sufficiently fast computers, etc, is not a good idea in my opinion, unless perhaps it has a