Algorithmic sabotage for static sites II: Images

what used to be an image after some sabotage, it's all a wobbly mess[1]

1: what used to be an image after some sabotage, it's all a wobbly mess

tl;dr: Here's a small addition to how you can also scramble images so that "AI" scrapers will end up with a poisoned data set

Earlier this year, I've written about how I setup this static website to not only serve human-readable data, but also "poisoned data"[1] to mess with the scrapers that are used to collect the training data sets for generative "AI".

1: written about how I setup this static website to not only serve human-readable data, but also "poisoned data"

As static website deployments via Codeberg Pages et al. aren't offering a lot of options, this approach relies on using `quixotic`[1] to scramble the text once when the page is built from markdown to HTML.

1: `quixotic`

And while it also randomly mixes image files, it leaves those images untouched.

A couple of days ago, Rossana Trotta pointed towards an option for images[1]: They had stumbled over Alun Jones' `fakejpeg`[2], which can pretty much on the fly create JPEG-ish files, which don't actually work as such, but should be expensive to read and evaluate for scrapers.

1: Rossana Trotta pointed towards an option for images

2: Alun Jones' `fakejpeg`

But of course, on-the-fly generation of images won't work either for static sites, so a solution that works at "compile-time" would be needed.

Luckily, Alun was more than happy to help, and shared some relevant code snippets on Mastodon[1] in our conversation about different approaches for fast and easy image-poisoning!

1: relevant code snippets on Mastodon

The idea for this poisoning is to create files that in principle still work and display as valid JPGs, but that are jumbled enough to become nonsense that (alongside with the unaltered _alt_-text), become useless noise to "AI" training.

Such a one-time processing, done by a small Python-script is well suited for integrating into the build-stage of static websites, as it's fast and doesn't require actively managing the generation of files.

To implement it the automated continuous deployment pipeline I use for making this website, just three minor tweaks were needed:

1. Write a small _Python_ wrapper for the functions by Alun[1].

1: a small _Python_ wrapper for the functions by Alun

2. Modify the container that runs the pipeline[1], to make sure that the necessary image manipulation libraries for Python are available.

1: the container that runs the pipeline

3. Add the little script from step 1 to the actual pipeline[1], to run the scrambling.

1: Add the little script from step 1 to the actual pipeline

You can see the outcome of this above, instead of a readable image, the random shifting around has lead to something unrecognizable.

And as Alun pointed out[1], there's different options for how to scramble the image: Instead of picking random "blocks" in an image and shuffling them slightly, one could go for fully randomly shuffling the image - or even sorting it by color.

1: Alun pointed out

So far, most of the stuff in those little changes is hard-coded for working for my own page here (steps 1 & 3 in particular), but it should hopefully be not too hard to adapt those for your own circumstances if you want to give it a try!


Source