Scraping Mastodon peer lists
Over the weekend I've decided to explore the Fediverse a bit, especially Mastodon. As the network is decentralized, my first step was to create a list of "instances" (servers running mastodon). Luckily mastodon has an API endpoint from which you get a list of peers for an instance. Using this API I was able to start with a single instance and find over 89503 mastodon servers (only ~24000 of these also worked and exposed their peers).
For my first steps I used requests. As this was too slow, with many servers not responding, I switched to aiohttp to run multiple concurrent requests.
I used a many loop which started new request tasks and waited for these tasks to finish. Whenever a request task finished I wrote the result in an SQLite database for later analyses and started another request task. This achieved a good throughput and crawled the "entire" mastodon world in a few hours.
I might add a cleaned up version of the script in the future.
Things I learned:
- first time seriously working with
asyncio
andaiohttp
in python. asyncio.wait
returns adone
andpending
. I used this to process done requests and afterwards replaced my tasks list with thepending
return value.- sqlite does not work well with asyncio. This is why stored the results in the main loop.
- I found it easiest to catch all error in my request function and return a tuple with an success indicator (
return result, success
) instead of allowing raised errors in the the async task.