Scraping Mastodon peer lists

# Published: 2022-11-13 by h4kor

Over the weekend I've decided to explore the Fediverse a bit, especially Mastodon. As the network is decentralized, my first step was to create a list of "instances" (servers running mastodon). Luckily mastodon has an API endpoint from which you get a list of peers for an instance. Using this API I was able to start with a single instance and find over 89503 mastodon servers (only ~24000 of these also worked and exposed their peers).

For my first steps I used requests. As this was too slow, with many servers not responding, I switched to aiohttp to run multiple concurrent requests.

I used a many loop which started new request tasks and waited for these tasks to finish. Whenever a request task finished I wrote the result in an SQLite database for later analyses and started another request task. This achieved a good throughput and crawled the "entire" mastodon world in a few hours.

I might add a cleaned up version of the script in the future.

Things I learned:

first time seriously working with asyncio and aiohttp in python.
asyncio.wait returns a done and pending. I used this to process done requests and afterwards replaced my tasks list with the pending return value.
sqlite does not work well with asyncio. This is why stored the results in the main loop.
I found it easiest to catch all error in my request function and return a tuple with an success indicator (return result, success) instead of allowing raised errors in the the async task.

Libove Blog

Scraping Mastodon peer lists

Things I learned:

Interactions