I recently added hashtag support to owl-blogs. The initial reason for this was to to make post more discoverable via ActivityPub, but I found it helpful to further categories posts.
The implementation was quiet simple. I use the markdown renderer goldmark which has a plugin for hashtags.
As tags are also part of microformats2, I wanted to mark the hashtags accordingly.
This is currently not possible with the hashtag extension.
I've extended this to allow adding arbitrary attributes to the link tag (Related Pull Request).
Until this is merged into the main repository I'll use my own version, which can be done by adding a replace directive to the go.mod
Thesis arguing that #AGI will be reached in the next decade. Have not yet read the full text.
Problems I see with the argumentation so far:
The Data Wall
The data wall is an (hard) open problem.
Larger models either will need massively more data or be orders of magnitude more efficient with the given data.
More data is not available at that scale.
The latest models are already trained on virtually the entire internet.
They argue that a strategy similar to AlphaGo, where the model created it's own training data by playing against itself, could be deploy for #GenAI.
I find this implausible as generating intelligent text beneficial in trying already requires a more capable AI.
Similarily being more efficient with data is still an open problem, but I don't know enough about this to evaluate how likely this is to happen.
Additionally, going forward, any new datasets will be poisoned by AI output as they are already used on massive scale to create "content". Research suggests that training on these data degrades the performance of models.
Large Scale Computing
Even is the data wall is broken scaling the models will need massive computing power consuming ever more resources and energy.
This will only be tolerated by the general public as long as this is overall beneficial to them.
There are already industries (mainly creative ones) being massively disrupted by the current generation of GenAI.
Philosophy Tube in Here's What Ethical AI Really Means, shows how strikes and collective action can be tools to prevent the development of more powerful AI systems.
A nice and simple solution to add comments to your blog.
I've decided to go the bit more complicated route and added #ActivityPub support to my blog directly. Any interaction will show up below the posts. However this requires a backend and will not work with a static site generator.
Thumbnails use the same file format as their parent files assuming the user already chose the best format for their images. The thumbnails are created with a width of 620px, equal to the content width of the blogs main body. If the image is already small enough the image data is simply copied.
The URL of a thumbnail can simply be generated by replacing /media/ with /thumbnail/.
I will still write some tests and see if any errors occur on my blog before merging this feature into the main branch.
My blog software (owl-blogs) uses a single SQLite database to store everything, including all files uploaded. I'm aware that storing large files in a relational database isn't best practice. It started out as a placeholder implementation, but I liked the idea to have a single file I can backup.
One reason against storing binary blobs in relational databases often stated is read performance, but I didn't find any benchmarks supporting this claim. Therefore I built a small test setup to see the difference between serving binary files out of a SQLite database vs serving from the file system directly.
As my blog is written in Go, I created the a simple server similar to my blog. It uses sqlx and go-sqlite3 for the database handling and net/http for the static file server
package main
import (
"log"
"net/http"
"github.com/jmoiron/sqlx"
_ "github.com/mattn/go-sqlite3"
)
type sqlBinaryFile struct {
Data []byte `db:"data"`
}
type sqlHandler struct {
Db *sqlx.DB
}
func (h *sqlHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
id := r.PathValue("filename")
var sqlFile sqlBinaryFile
h.Db.Get(&sqlFile, "SELECT data FROM files WHERE id = ?", id)
w.Write(sqlFile.Data)
}
func main() {
db := sqlx.MustOpen("sqlite3", "files.db")
sql := &sqlHandler{Db: db}
fs := http.StripPrefix("/dir/", http.FileServer(http.Dir("./static")))
http.Handle("/dir/", fs)
http.Handle("/sqlite/{filename}", sql)
log.Print("Listening on :3000...")
err := http.ListenAndServe(":3000", nil)
if err != nil {
log.Fatal(err)
}
}
As a test set I created 2000 files between 200kb and 4MB in size using a simple python script:
import os
import random
for i in range(2000):
os.system(f"head -c {(random.randint(200, 4000))}K </dev/urandom > static/{i:05d}.bin")
The SQLite database was created with this script:
import os
import sqlite3
os.remove("files.db")
con = sqlite3.connect("files.db")
cur = con.cursor()
cur.execute("CREATE TABLE files( id VARCHAR(255) PRIMARY KEY, data BLOB NOT NULL )")
for f in os.listdir("static"):
print(f)
data = open("static/" + f, "rb").read()
cur.execute("INSERT INTO files(id, data) VALUES (?, ?)", (f, data))
con.commit()
To benchmark the server I created two files listing all file URLs (one for sqlite, one fot filesystem) and used siege to run the benchmark with this configuration.
siege -f urls_sqlite.txt -c 1 -b --time=10s -j
The test was executed on my laptop:
CPU: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
CPU max MHz: 4600,0000
Memory: 16 GB
I ran the test with different concurrency and plotted the results:
For a low throughput system (such as my blog) the difference between SQLite and the filesystem is small enough to not care about.
The possible throughput (transaction/second) of the filesystem is ~2.3 times higher.
The response time grows slower with increased concurrency.
For the time being I will stick with my SQLite solution.
Once my blog gets really popular I can easily change the implementation of the binary repository.