Libove Blog

Personal Blog about anything - mostly programming, cooking and random thoughts


Schnelle Asia Pfanne

Servings: 2-3 Portionen , Prep Time:

Ingredients

  • 200g Udon Nudeln
  • 4 EL Sojasauce
  • 1 EL Reisessig
  • 1 EL Agavendicksaft
  • 2 TL Speisestärke
  • Gemüse und Sojaschnetzel
  • 1 Stück Ingwer (optional)
  • 1 Knoblauchzehe (optional)

Instructions

Dieses Rezept koche ich zur Zeit sehr häufig in meiner Mittagspause. Es ist schnell und einfach und lässt sich endlos variieren.

  • Gemüse länglich, mundgerecht schneiden. Sojaschnetzel aufkochen und abgießen.
  • Ingwer und Knoblauch fein hacken
  • Sojasauce, Reisessig, Agavendicksaft mit 200ml Wasser verrühren
  • Udonnudel nach Packung kochen
  • Sojaschnetzel und Gemüse anbraten. Zutaten je nach Garzeit später hinzugeben.
  • Ingwer und Knoblauch 1 Minute mit anbraten und mit Sauce ablöschen
  • Speisestärke in etwas Wasser auflösen und unterrühren
  • fertige Nudeln abgießen und zur Sauce hinzufügen

Beliebte Gemüse Kombination

  • Sojaschnetzel und TK Erbsen
  • Sojaschnetzel und Zucchini
  • Möhren und TK Erbsen
  • Paprika und Zwiebeln

#rezept #schnell #vegan



KI Chatkontrolle

Mir fällt gerade auf, dass durch Integration von "KI Tools" Auf Betriebssystem Ebene die Infrastruktur geschaffen wird, die für die Chatkontrolle gefordert wurde.

Chatkontrolle

Da fast alle gängigen Messenger inzwischen Ende zu Ende verschlüsselt kommunizieren haben Sicherheitsbehörden es schwer Chats zu überwachen. Da auch die Betreiber der Chatprogramme keinen Einblick haben ist die einzige verbleibende Möglichkeit zu Überwachung das abfangen der Nachrichten auf dem Gerät der Nutzer. Dies benötigt zur Zeit den Einsatz eines "Staatstrojaners".

Da der Trojanereinsatz ein schwerer und kostspieliger Eingriff ist, versucht die Politik immer wieder eine Alternative zu schaffen. Der letzte Angriff war die sogenannten Chatkontrolle. Unter dem Vorwand der Kindesmissbrauchsbekämpfung sollen Chatanbieter gezwungen werden Chats nach dokumentiertem Kindesmissbrauch zu durchsuchen.

Technisch bedeuten diese Versuche, dass die Ende-zu-Ende Verschlüssung entweder entfernt oder umgangen werden muss. Anbieter könnten die Verschlüsselung anpassen um Daten auf ihren Servern zu analysieren oder alternative über einen zweiten Kanal zur Analyse ausspielen.

KI Tools

Im aktuellen GenAI Hype wird in derzeit in jede App ein "KI Funktion" eingebaut die nicht bei drei auf den Bäumen ist. Die Giganten Google and Microsoft gehen direkt einen Schritt weiter und bauen die KI direkt in ihre Betriebssystem ein.

Bei Microsoft heißt das "Feature" Recall und ist standardmäßig Teil von Windows 11. Recall macht alle 30 Sekunden einen Screenshot und speichert diesen in einer Datenbank. Die KI kann diese Datenbank durchsuchen und dem Nutzer so helfen Sachen wiederzufinden. Die Verarbeitung geschieht laut Microsoft ausschliesslich lokal auf dem eigenen Rechner.

Google integriert zur Zeit ihr KI System Gemini tief in das Android Betriebssystem. Es soll es der Benutzerin ermöglichen die KI über alle Apps hinweg zu benutzen. Ohne das die App selbst KI Funktionen implementiert. Dies bedeutet jedoch auch, dass Gemini auf die Daten in allen Apps zugreifen kann. Da die Daten "in der Cloud" verarbeitet werden, erlangt Google somit Zugriff auf Chats und Daten die vorher privat waren.

(Mit Apples Plänen habe ich mich nicht auseinander gesetzt, aber auch hier wird im AI Game mitgespielt)

Die Entwicklung von KI Modellen benötigt massiv Daten. Aktuelle Modelle haben bereits alle frei zugänglichen Quellen leer gesaugt und benötigen Nachschub. Auch wenn die Systeme zur Zeit nicht genutzt werden um Trainingsdaten zu generieren, ist es nicht abwegig, dass sich dies schnell ändern wird. Im Wettkampf um die besten Modelle werden immer mehr Daten benötigt. Und Daten die Nutzerverhalten abbilden sind nicht erst im aktuellen KI Hype wertvoll geworden.

Auch OpenAI will Daten durch einen eigenen Browser Daten sammeln. Meta hatte noch nie ein augeprägtes Verständnis für Datenschutz und sammelt mit Facebook und Instagram fleißig mit.

Nur ein Schritt fehlt

Die Infrastruktur die man zum sammeln von KI Trainingsdaten und zur Überwachung ala Chatkontrolle benötigt unterscheidet sich nur in wenigen Details. In beiden Fällen benötigt man Funktionen die auf den Geräten der Nutzer Daten sammeln. Diese Daten müssen dann ausgeleitet werden um sie auf den eigenen Servern zu verarbeiten. Der Unterschied zwischen KI und Chatkontrolle ist am Ende nur was man mit diesen Daten machen will.

Die Chatkontrolle ist zum Glück vorest gescheitert, aber die Tech-Konzerne bauen zur Zeit Infrastruktur die sich leicht anpassen lassen kann um diese umzusetzen. Recall kann die Screenshots nebenbei auch auf potenziell illegale Inhalte untersuchen und melden. Bei Gemini werden die Daten direkt von Google verarbeitet und neben der AI kann auch noch ein zweites System Analysen durchführen.

Die so entstehenden Daten-Silos werden auch ein begehrtes Ziel für die Sicherheitsbehörden werden. Da die Infrastruktur beim nächsten Versuch eine Überwachung einzuführen quasi schon existiert, wird die Abwehr diese Begehren deutlich schwieriger werden.

#politik #ki #überwachung #datenschutz


Re: Wikipedia to follow new album releases

Reply to: https://jamesg.blog/2025/07/15/brainstorming-a-tool-to-follow-new-album-releases-with-wikipedia

You don't need to parse the data from the Wikipedia page, it is also available in Wikidata. However the data is not (always?) the same as in Wikipedia. For the example below the latest album was not linked yet (fixed now).

Example for Rise Against

SELECT ?album ?performerLabel ?albumLabel ?publication_date WHERE {
  VALUES ?performer {
      wd:Q246352
    }
   ?album wdt:P175 ?performer .
   ?album wdt:P31 wd:Q482994 .
   ?album wdt:P577 ?publication_date
   SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

Re: Origin of link router pattern

Reply to: https://jamesg.blog/2025/07/19/link-router-pattern

I think the origin, or popularity, of services such as LinkTree is the artificial limitations of links in social networks. Instagram does not allow links in post and you can only have a single link in your profile. This makes it hard to promote a song or to link to multiple other channels.

Other platforms, such as LinkedIn, are suspected to limit the reach of articles if they include several links. Therefore people use an indirection through a router to keep links to a minimal.



Implementing my own Time Series Database: Core Structure (Part 1)

I'm currently working a lot with time series data and databases. That sparked an interest in looking under the hood and understand database systems more deeply. Fueled by "how hard can this actually be"-ignorance I've started to implement my own time series database. Because learning one new thing wasn't enough, I also learned zig in parallel.

I've started the journey by binge-watching the CMU Intro to Database Systems lectures. Andy Pavlo is an amazing lecturer, and I can only recommend this series to anyone wanting to understand DB systems or being nostalgic about their time at university.

My goal was to build a DB that could:

  • store time series identified by an ID
  • query and transform these series using a simple query language
  • be accessible over the network

Core Structure

I've started my journey by figuring out to work with pages. This finally resulted in the following overall architecture for the core of my database.

A page directory serves as the central component handling pages in memory. Any other part of the system will retrieve and interact with pages through this directory. The directory itself is only responsible for pages in memory. Whenever it needs to persist or read pages from disk, it will use the File Manager.

The file manager is the single structure interfacing with the file system. Its only responsibility is to persist pages to disk and retrieve them again later on. Because to the layout of files I've chosen it knows some details about the database implementation, but could be refactored to be completely agnostic of the database using it.

The main user of the page directory is the B+Tree implementation. Series in the database system are represented by B+Trees. Other systems will instantiate B+Tree instances for each series they interact with. The tree structure loads necessary data from the series pages via the page directory.

Pages

Pages are the core entities of database systems. As databases have to deal with (much) more data than will fit into memory data has to be read and written to disk frequently. Therefore, all persistent data is structured into pages. In essence, pages are just data blocks of a specified size. These can be moved between disk and memory without transformations.

In my system, the start of each page is used as a page header. The header specifies what kind of data is stored in the page. Additionally, it has a version number for future use and a CRC hash to detect data corruption. The body of a page is just a block of data, which structure depends on its type. The interpretation of the body is controlled by the user of the page.

As pages are transferred between disk and memory, it is important to have exact control over their layout. Regular struct have no guarantees on the layout of fields in memory and have additional padding bytes between fields for alignment. Therefore, I used packed struct for all page definitions.

The size of a page is controlled by a comptime variable PAGE_SIZE. Thanks to zig's comptime feature, all related sizes and offsets can be derived from this and compile time. I can also derive capacity of the B+Tree branches and leaves (see below) from this. Therefore, I can simply adjust the page size, and all other parts will adjust accordingly.

/// total size of pages in bytes
pub const PAGE_SIZE: usize = 4 * 1024;
/// size of page header in bytes
pub const PAGE_HEADER_SIZE: usize = @bitSizeOf(PageHeader) / 8;
/// size of page data section in byters
pub const PAGE_DATA_SIZE: usize = PAGE_SIZE - PAGE_HEADER_SIZE;

One limitation I ran into is that zig does not (yet) allow arrays in packed structs . Because of this limitation, I had to define an integer type of the correct size to specify the body of the page, instead of simply using [BODY_SIZE]u8. Another alternative would have been @Vector(BODY_SIZE, u8) (as found in the issue), but this felt more wrong.

pub const PageHeader = packed struct {
    type: PAGE_TYPE,
    version: u8,
    crc: u32 = 0,
};

pub const PageData = @Type(.{ 
	.int = .{
		.signedness = .unsigned,
		.bits = PAGE_DATA_SIZE * 8
	}
});

pub const Page = packed struct {
    header: PageHeader,
    // data is only used as a blob of data, and not directly used
    data: PageData,
}

File Structure

After implementing my page structure, I had to think about how to organize them on disk. I've decided to store each series created in the database as an individual file. Pages within a file/series are identified by there position within the file. Thereby, it was easy to get any page for a particular series by the file name. An additional hope was that this would keep the related data close in the filesystem, improving performance, but I never tested this hypothesis.

This structure means that pages have two IDs; a local and a global ID. The local ID is the page's position within a series file. The global ID is the concatenation of series ID and the local page ID. The first page (local ID 0) of each file is a series header page. This page keeps information about the series. At the moment it only stores the ID of the root pages for the B+Tree.

Page Directory

The page directory is responsible for holding and managing pages in memory. On initialization, it allocates a fixed number of slots which can hold pages. Thereby, no memory has to be allocated when a new page is loaded from disk and the footprint of the directory is static. For each slot a PageHandle handle is created. These handles track usage of the slot; which pages is currently loaded, whether it has been modified or if it is currently in use.

The interface for the using systems is rather simple. Pages can be requested by there ID, either in read or write mode. The page directory transparently loads them from disk, through the file manager, and provides the corresponding page handle to the caller. Once the page is no longer needed the page handle is "returned" to the directory.

Eventually, during the runtime of the system, enough pages will have been loaded that all pages are occupied. At this point, a currently unused page has to be evicted. Each page handle has a used flag, which is set whenever anyone uses the page. When the directory looks for a free slot it iterates over the slots and skips any slot with a raised used flag and unsets it. Once it encounters a slot with an unset flag which is currently not locked the pages is replaced. This approach approximates "least recently used", but is much faster than a full implementation.

Modified pages are not directly saved to disk when they are freed by a writing user. Instead they are kept dirty in the directory and only written to disk when the page has to be evicted. To prevent data loss, I've added a background worker with cycles through all slots to persist dirty pages if they are currently not in use. (A lesson learnt after losing a few weeks of data which was never persists.)

B+Tree

I chose to represent series as B+Trees, a.k.a. the best data structure in computer science. In contrast to other time series databases, which often use LSM trees I chose to use B+Tree as it keeps the data ordered at every point in time. This makes traversal of the series data simple and (hopefully) fast.

B+Trees consist of branch and leaf nodes. Branch nodes keep a list of references to their children. Between two child references the branch keeps a threshold values. The referenced child between two threshold values will only hold values between these values. The rules by which the tree is transformed when values are added or removed keep it balanced, where a leaf nodes are at the same depth. This allows lookup of values within a few steps through the tree.

In my database the leaf nodes store tuples of (timestamp, value) in order. Additionally they hold a reference to their next sibling. This allows fast traversal of the leaf nodes when scanning over a time period.

Next Part: Query Language

#db #database #tsdb #dragondb #zig #dev