Re:
Reply to: https://blog.libove.org/posts/ai-posioning/Artists on Artstation started to post "No AI" images
Personal Blog about anything - mostly programming, cooking and random thoughts
Artists on Artstation started to post "No AI" images
The age of AI generated “content” (I hate that word!) has come. Stable Diffusion, Dall-E and Midjourney are used to create artworks winning contests. Especially Stable Diffusion, sharing their weights openly, has led to a rapid adoption of AI image tools into workflows and tools used by artists every day. ChatGPT can create high-quality texts which can no longer be distinguished from texts written by humans. Platforms like StackOverflow have felt the need to temporarily ban their usage, until they figure out how to deal with this.
AIs (still) rely on vast amounts of training data. To build improved models, larger and larger datasets are required. Models will have to be retrained periodically to include new concepts in the model. Image generators have to be updated to be adjusted to the ever shifting taste of the humans using them. Text generators have to learn the latest memes to stay relevant.
The explosive popularity of AI generated content will lead to new challenges when building new datasets. I see two ways how the progress on these generative models can be slowed down or even stopped entirely by their own popularity.
The datasets used for this a generally created by scraping the required data from the internet. This means everything posted to Reddit, Twitter and the rest of the internet is collected, cleaned and prepared for training AIs. The cleaning part will become harder and harder with each new iteration of generative models, as you want to avoid content created by the AI in your training set.
An AI system trained on data created by the system itself will not become better, as the information in the data is already known by the system. If the data used are valid examples, this might just waste computation time while training data model. But any poor examples will degrade the performance of the resulting model.
If a considerable portion of the training data comprises examples created by a previous version of a generative model, the next model will learn to recreate the style and errors of the previous model.
A lot of artists hate generative models, especially for their ability to mimic the style of individual artists. This has led some to hide their artworks, because they don’t want to provide even more data to these systems. If the concerns of artists are ignored, some might try to sabotage the systems which steal their art style.
A common technique for using Stable Diffusion is to give a prompt like “A cat sitting on a bench, by Artist X”. For artists with a large portfolio, this creates results which, on a first glance, could have been created by these artists. As datasets are generated automatically, it might be possible to introduce adversarial examples into the training data which destroy such prompts.
Artists might publish decoy “artworks” on their feeds. These decoys would be easily recognisable by humans, but scraping systems would include them in the training sets. If an artist has more decoys than real artworks associated with their name, the AI system will mimic the style of the decoy.
Artworks can be published with surrounding noise. Instead of just publishing an image, the image might be extended by random frames. The descriptions could be extended by additional nonsense descriptions.
All these countermeasures can be circumvented, but this will be expensive. For general models, trained on massive data models, such cleaning measures will most likely be too expensive. However, fine tuning a model only requires a relatively small training set. Cleaning such a dataset for a single artist will be a simple task. The poisoning will only be a minor inconvenience for people creating such specialized models.
I'm using this aliases to work with Django. They save me a few seconds per command and lower the burden of trying something in the shell by a tiny amount :).
alias vv=". venv/bin/activate"
alias djs="python manage.py shell"
alias djr="python manage.py runserver"
alias djmm="python manage.py makemigrations"
alias djmi="python manage.py migrate"
Extended the editor to allow replies. Still have to add a scraper to fill in the title of the referenced page automatically.
I'm taking part in Advent of Code 2022 and post my solutions on GitHub.
This is my first article written with an editor on my website!
Up until now I wrote all my articles either directly on the server or on my PC. For this I use a convoluted system, partially working on the server, partially on my PC with git and ssh in between. As this process is tedious it keeps me from writing short updates.
The online editor is still every limited but I will extend it as needed. Additionally I've adjusted some of the internal structure of the blog to allow for different types of post. This will allow me to also write short notes, which will not show up in the main feed.
I've published my graph embedding library graph_force on GitHub and PyPi. I wrote about the process of building this a few days ago.
My own algorithm turned out to be quiet slow when compared to networkx. For this reason I also reimplemented the networkx algorithm, but with multithreading support. The most important feature for me was the ability to load the graph from a binary file. While networkx used to much data while ingesting the graph data, I can effortlessly write it to file and load it in graph_force.
At the moment this library fulfils my needs, but with publishing I commit to maintaining it. Maybe this is useful to someone else.
Creative usage of web assembly as a "universal" binary, running on every machine. I'm currently using both Rust and Go. Will keeps this in mind if I ever want to combine them :)
I just realised I never wrote about this small project I created back in 2020. The show Community had a bunch of Twitter account for the characters on the show. They wrote tweets in characters every now and then, which added some additional character interactions to the show. A few times the twitter interactions were also referenced in the show.
I've scraped the tweets of all (known) accounts and created a website to browse all community tweets. I tried to assign the tweets to episodes based on the airing dates. With the current acquisition of Twitter by Musk and the following "turbulences" it might be a good idea to check the project again to see if more data can be preserved. It would be a shame if this piece of my favorite show disappeared forever.
The source and prepared datasets can be found on GitHub.
The show also add a homepage for the fictional community college. Sadly the website is no longer available but it's in the internet archive. Sadly the videos of the A/V Department did not survive.
After scraping "all" Mastodon instances, I wanted to visualize the graph of instances. My expectation is that this is a (quiet dense) social graph. To bring order in such a graph a Force-directed graph model can be used. I previously used the networkx implementation for this. However the peers graph I currently have is too large, with 24007 nodes and over 80 million edges. When trying to import this into networkx I simply run out of memory (16GB). After asking for advise on Mastodon, I tried out gephi, which also ran out of memory before loading the entire graph.
Because I still want to visualize this graph, I've decided to write my on graph layouting program. For the first implementation I followed the slides of a lecture at KIT. This gave me some janky but promising results, as I was able to load my graph and an iteration only required ~10 seconds. To validate my implementation I created a debug graph, consisting of 2000 nodes with 4 clusters of different sizes.
After this first implementation I took pen and paper and thought about the problem a bit. This lead to an improved version, with a simpler model leading to faster execution times and quicker convergence.
Embedding the mastodon instances graph, is still challenging. The algorithm creates oscillation in the graph, which I suspect are introduced by one (or multiple) large cliques. I will post an update soon.
Update:
Bonus image of the florentine families graph: