Libove Blog

Personal Blog about anything - mostly programming, cooking and random thoughts

Missed Shot at Artificial General Intelligence

The stated goal of OpenAI is "to ensure that artificial general intelligence benefits all of humanity". With the release of ChatGPT, they might have missed humanity’s only shot at creating an Artificial General Intelligence (AGI).

It's all about data

The performance ("intelligence") of large language models (LLMs) mainly depends on the scale of its training data and the size of the model [1]. To create a better LLM you need to increase its size, and train it on more data. The architecture and configuration of an LLM can almost be neglected in comparison.

However, the intelligence of a model does not grow linearly with its size [1]. There is a diminishing return on increasing the size of LLMs. If you increase the size of model and training set by 10x, you will only get an increase in performance as the previous 10x increase accomplished. This explains why vast amounts of data are required to effectively train large language model.

ChatGPT 3 was trained on 500 billion tokens [2]. There are no official numbers (that I could find) on how much data was used to train ChatGPT 4, but rumors in the AI community state that it was trained on 13 trillion tokens. With this numbers the performance step from 3 to 4 took an 26x increase in data.

The estimation for the size of publicly available, de duplicated data is 320 trillion tokens [4]. This is 24.6x more data than ChatGPT4 was (likely) trained on. If these numbers are correct, the performance of LLMs will only increase as much as it increased from ChatGPT3 to ChatGPT4 before we ran out of data.

I doubt that this will be enough to reach AGI level intelligence.

Poisoned Data

Now you might say "we produce more data every day, models can get better in the future". We could just wait some decades, train new models from time to time and see a gradual increase in performance. And one day we suddenly have an AGI. But the release of ChatGPT created a problem. It poisoned any data collected after November 30, 2022.

The release of ChatGPT was the wet dream of every spammer, bot operator, troll and wannabe influencer. Suddenly you could create seemingly high quality content, indistinguishable from human written text, at virtually zero cost. Ever since the internet gets filled with LLM generated content. And this is a problem for all future trainings.

Models that are trained on their own generations (or data created by other models) start to forget, there performance declines [3]. Therefore you have to avoid training on AI generated content, otherwise the increase in data may decrease your performance. As it is virtually impossible to clean a dataset from AI generated text, any data collected after Nov. 2022 should be avoided. Maybe you can still use a few more months or years of data, but at some point more data will hurt the model more than it helps.

The publicly available data that we got now is all we will get to train an AGI. If we need more data it will have to collected at an exorbitant price to ensure it's not poisoned by AI generated data.

We reached a dead end

The size of current training sets and potentially available data shows that we will not reach AGI levels with the current state of the art approaches. Due to data poisoning we will not get substantially more usable data. Thereby AGI is (currently) not achievable. Opening pandora's box of accessible generative AI may killed our chance of creating an artificial general intelligence.

If we want to build an AGI we will have to do it with the data we have now.

#AI #AGI #LLM #GenAI


Szechuan Pfeffer Udon Nudeln

Servings: 1 Portion , Prep Time:

Ingredients

  • 100 g Udon Nudeln
  • Handvoll Sojaschnetzel
  • 1 kleine Zucchini
  • 1 kleine Karotte
  • 1 EL Szechuan Pfeffer
  • 1 Knoblauchzehe
  • 1 cm Ingwer
  • 1 EL schwarze Bohnenpaste
  • 1 EL Sojasauce
  • 1 EL Agavendicksaft
  • 200 ml Wasser

Instructions

Dieses Rezept war ein bisschen frei Schnauze, ist aber wirklich gut geworden. Die Mengen sind im nachhinein geschätzt, also abschmecken und gegebenfalls anpassen.

Szechuan Pfeffer Udon Nudeln

  • Sojaschnetzeln kochen
  • Knoblauch und Ingwer fein hacken
  • Alle Gewürze und Wasser mischen
  • Karotten und Zucchini in dünne Scheiben schneiden
  • Nudeln nach Packung kochen
  • Sojaschnetzel anbraten
  • Zucchini und Karotten für 2 Minuten mitanbraten
  • Mit Gewürzmischung ablöschen
  • Sauce reduzieren und Nudeln unterheben

#vegan #nudeln #udon #zucchini #karotten #szechuan #szechuanpfeffer


Random Tables for DnD 5e

List of random tables I've created to quickly create shops or treasures when DMing.

Spells

Magic Items

Armor

Rings

Rods

Staves

Weapons

Wondrous Items

Generators

#DnD #5e #random #generator #tabletop


Nudeln mit Erbsen

Servings: 3 Portionen , Prep Time:

Ingredients

  • 250g Nudeln
  • 200g TK Erbsen
  • 200ml vegane Sahne
  • 1 Zwiebel
  • 1 Knoblauchzehe
  • 50g Hefeflocken
  • 1 EL Olivenöl
  • 1 EL Tomatenmark
  • 1 TL Paprikapulver
  • 1 TL Oregano
  • Salz und Pfeffer

Instructions

Ein einfaches und schnelles Nudelgericht ideal für Kleinkinder. Schmeckt gut mit veganem Parmesan.

  • Zwiebel würfeln und Knoblauch fein hacken
  • Nudeln kochen
  • Zwiebeln und Knoblauch in Olivenöl glasig anbraten
  • Tomatenmark für 2-3 Minuten mitanbraten
  • Erbsen, Hefeflocken und Gewürze hinzugeben, kurz mit anbraten
  • Sahne und 100ml Wasser hinzugeben und gut verrühren
  • Sauce unter gelegentlichem Rühren aufkochen
  • Nudeln abgießen und alles gut durchmischen
  • Mit Salz und Pfeffer abschmecken

#vegan #rezept #nudeln #erbsen


Veganer Pesto Nudelsalat mit Rucola

Servings: 4 Portionen , Prep Time:

Ingredients

  • 500g Fusilli Nudeln
  • 250g Cherry Tomaten
  • 125g Rucola
  • 1 Glas veganes grünes Pesto
  • 2 EL Agavendicksaft
  • 2 EL Senf, mittelscharf
  • 50g veganer Parmesan (optional)

Instructions

Die vegane Version meines Lieblings-Nudelsalats.

  • Nudeln kochen und abkühlen lassen oder mit kalten Wasser abkühlen
  • Rucola waschen und abtropfen lassen
  • Tomaten halbieren oder vierteln
  • Nudeln, Tomaten, Pesto, Agavendicksaft und Senf vermischen
  • Mit veganem Parmesan abschmecken (optional)
  • Rucola untermischen

#vegan #rezept #nudelsalat #einfach


Microformat Categories for Hashtags in owl-blogs

I recently added hashtag support to owl-blogs. The initial reason for this was to to make post more discoverable via ActivityPub, but I found it helpful to further categories posts. The implementation was quiet simple. I use the markdown renderer goldmark which has a plugin for hashtags.

As tags are also part of microformats2, I wanted to mark the hashtags accordingly. This is currently not possible with the hashtag extension.

I've extended this to allow adding arbitrary attributes to the link tag (Related Pull Request). Until this is merged into the main repository I'll use my own version, which can be done by adding a replace directive to the go.mod

replace go.abhg.dev/goldmark/hashtag => github.com/H4kor/goldmark-hashtag v0.0.0-20240619193802-bec327f5be38

#go #dev #owlblogs #markdown #indieweb




Link:Situational Awareness - The Decade Ahead

Link: https://situational-awareness.ai/

Thesis arguing that #AGI will be reached in the next decade. Have not yet read the full text.

Problems I see with the argumentation so far:

The Data Wall

The data wall is an (hard) open problem. Larger models either will need massively more data or be orders of magnitude more efficient with the given data.

More data is not available at that scale. The latest models are already trained on virtually the entire internet. They argue that a strategy similar to AlphaGo, where the model created it's own training data by playing against itself, could be deploy for #GenAI. I find this implausible as generating intelligent text beneficial in trying already requires a more capable AI.

Similarily being more efficient with data is still an open problem, but I don't know enough about this to evaluate how likely this is to happen.

Additionally, going forward, any new datasets will be poisoned by AI output as they are already used on massive scale to create "content". Research suggests that training on these data degrades the performance of models.

Large Scale Computing

Even is the data wall is broken scaling the models will need massive computing power consuming ever more resources and energy. This will only be tolerated by the general public as long as this is overall beneficial to them. There are already industries (mainly creative ones) being massively disrupted by the current generation of GenAI. Philosophy Tube in Here's What Ethical AI Really Means, shows how strikes and collective action can be tools to prevent the development of more powerful AI systems.

#AI