After scraping "all" Mastodon instances, I wanted to visualize the graph of instances.
My expectation is that this is a (quiet dense) social graph.
To bring order in such a graph a Force-directed graph model can be used.
I previously used the networkx implementation for this.
However the peers graph I currently have is too large, with 24007 nodes and over 80 million edges.
When trying to import this into networkx I simply run out of memory (16GB).
After asking for advise on Mastodon, I tried out gephi, which also ran out of memory before loading the entire graph.
Because I still want to visualize this graph, I've decided to write my on graph layouting program.
For the first implementation I followed the slides of a lecture at KIT.
This gave me some janky but promising results, as I was able to load my graph and an iteration only required ~10 seconds.
To validate my implementation I created a debug graph, consisting of 2000 nodes with 4 clusters of different sizes.
After this first implementation I took pen and paper and thought about the problem a bit.
This lead to an improved version, with a simpler model leading to faster execution times and quicker convergence.
Embedding the mastodon instances graph, is still challenging.
The algorithm creates oscillation in the graph, which I suspect are introduced by one (or multiple) large cliques.
I will post an update soon.
Over the weekend I've decided to explore the Fediverse a bit, especially Mastodon.
As the network is decentralized, my first step was to create a list of "instances" (servers running mastodon).
Luckily mastodon has an API endpoint from which you get a list of peers for an instance.
Using this API I was able to start with a single instance and find over 89503 mastodon servers (only ~24000 of these also worked and exposed their peers).
For my first steps I used requests.
As this was too slow, with many servers not responding, I switched to aiohttp to run multiple concurrent requests.
I used a many loop which started new request tasks and waited for these tasks to finish.
Whenever a request task finished I wrote the result in an SQLite database for later analyses and started another request task.
This achieved a good throughput and crawled the "entire" mastodon world in a few hours.
I might add a cleaned up version of the script in the future.
Things I learned:
first time seriously working with asyncio and aiohttp in python.
asyncio.wait returns a done and pending. I used this to process done requests and afterwards replaced my tasks list with the pending return value.
sqlite does not work well with asyncio. This is why stored the results in the main loop.
I found it easiest to catch all error in my request function and return a tuple with an success indicator (return result, success) instead of allowing raised errors in the the async task.
I just finished implementing IndieAuth, which allows me to login to sites and tools, which support this, using my blog.
The main reason I've implemented this is because I want to use Micropub to write and publish.
At the moment creating an new entry is a bit cumbersome.
It's ok for longer posts but it prevents me from doing things like bookmarks or check ins.
If you want to implement IndieAuth yourself I recommend using the living standard document directly.
I've started using the wiki page as reference and later realised that it was outdated (the outdated content has been removed in the meantime).
IndieAuth is basically OAuth2 with a few additions and conventions. If you are already familiar with OAuth2 you probably not have many problems implementing this.
My (first) implementation can be found in this pull request.
Before I continue with Micropub, I have to do some refactoring as the code is still a bit rough around the edges.
The books play in a future version of the solar system where virtually everything is turned into smart matter and the boundaries between "reality" and the virtual world are blurred to the point where they almost no longer matter. The series tells the story of "Jean le Flambeur" and "Mieli", who travel the solar system to fulfil their duties to a Goddess. The journey takes them to Mars, Earth and Saturn.
The yaml package in Go uses a default time format for marshalling and will only accept this format when reading a yaml file.
This is ok if you use the serializer to exchange data between programs, but quickly falls apart when user generated files have to be parsed.
As I'm using yaml headers in Markdown files (and yaml files in various other places) to generate this blog, I needa more robust and user friendly way to set publishing dates.
Ideally the blog software should accept any time format.
For this guide I will assume that we have a yaml format like this:
title: "How to parse YAML dates"
date: 17-10-2022
The corresponding definition in Go looks like this:
type Data struct {
Title string `yaml:"title"`
Date time.Time `yaml:"date"`
}
When we try to unmarshal the above yaml it will result in an error: parsing time "17-10-2022" as "2006-01-02T15:04:05Z07:00": cannot parse "0-2022" as "2006".
To support this format the unmarshal behavior of the Data struct has to be overwritten.
func (d *Data) UnmarshalYAML(unmarshal func(interface{}) error) error {
type T struct {
Title string `yaml:"title"`
}
type S struct {
Date string `yaml:"date"`
}
// parse non-time fields
var t T
if err := unmarshal(&t); err != nil {
return err
}
d.Title = t.Title
// parse time field
var s S
if err := unmarshal(&s); err != nil {
return err
}
possibleFormats := []string{
"02-01-2006",
time.Layout,
time.ANSIC,
}
var found_time bool = false
for _, format := range possibleFormats {
if t, err := time.Parse(format, s.Date); err == nil {
d.Date = t
found_time = true
break
}
}
if !found_time {
return fmt.Errorf("could not parse date: %s", s.Date)
}
return nil
}
In order to control the unmarhalling of a yaml file, we have to provide an UnmarshalYAMLfunction for the struct.
This function is passed an unmarshal function which can be used to parse arbitrary structs.
We will use this function multiple times to retrieve different parts of the yaml file.
type T struct {
Title string `yaml:"title"`
}
type S struct {
Date string `yaml:"date"`
}
Next we separate our Data struct into two parts; one for everything we just want to keep as it is and one for the date field.
Note that the date field is using string as its type, as we want to do the time parsing ourselves.
// parse non-time fields
var t T
if err := unmarshal(&t); err != nil {
return err
}
d.Title = t.Title
For all fields that we want to keep untouched, we just have to call the unmarshal function and, if no error occured, copy the data to our data struct.
To parse the date we start similarily, by getting the time string.
// parse time field
var s S
if err := unmarshal(&s); err != nil {
return err
}
Aftwards the function loops over all formats that we want to support and check if the string matches any format.
The list of possible formats can be extended.
I kept it short in this example.
You can find the full list I use at the bottom of the tutorial.
possibleFormats := []string{
"02-01-2006",
time.Layout,
time.ANSIC,
}
var found_time bool = false
for _, format := range possibleFormats {
if t, err := time.Parse(format, s.Date); err == nil {
d.Date = t
found_time = true
break
}
}
Finally, we check if any format matched.
if !found_time {
return fmt.Errorf("could not parse date: %s", s.Date)
}
return nil
Now our Data struct will accept any time format we add to the possibleFormats list.
This is the full list I currently use in my blog software, the latest code can be found in the owl-blogs project.
I've worked on my microformat markup and added more author information to the h-entry.
Additionally I've added support for replys to my Markdown files, by defining the reply-to in the meta data like such: