Libove Blog

Personal Blog about anything - mostly programming, cooking and random thoughts

Scraping Mastodon peer lists

Over the weekend I've decided to explore the Fediverse a bit, especially Mastodon. As the network is decentralized, my first step was to create a list of "instances" (servers running mastodon). Luckily mastodon has an API endpoint from which you get a list of peers for an instance. Using this API I was able to start with a single instance and find over 89503 mastodon servers (only ~24000 of these also worked and exposed their peers).

For my first steps I used requests. As this was too slow, with many servers not responding, I switched to aiohttp to run multiple concurrent requests.

I used a many loop which started new request tasks and waited for these tasks to finish. Whenever a request task finished I wrote the result in an SQLite database for later analyses and started another request task. This achieved a good throughput and crawled the "entire" mastodon world in a few hours.

I might add a cleaned up version of the script in the future.

Things I learned:

  • first time seriously working with asyncio and aiohttp in python.
  • asyncio.wait returns a done and pending. I used this to process done requests and afterwards replaced my tasks list with the pending return value.
  • sqlite does not work well with asyncio. This is why stored the results in the main loop.
  • I found it easiest to catch all error in my request function and return a tuple with an success indicator (return result, success) instead of allowing raised errors in the the async task.


Building a Spring Graph Layout Algorithm in Rust