Libove Blog

Personal Blog about anything - mostly programming, cooking and random thoughts

Materialized Views in PostgreSQL

At my hobby project Podcast de facto Standard I ran into problems querying the required data to render the report pages. My first fix for this was to query and process all data (for all reports) once per day and to store the results as one big JSON object in an additional table. This worked well until the server was running out of memory because too much data had to be loaded from the database to be processed in python.

I generally try to avoid database specific feature for as long as possible in my projects. Ideally I'm able to use SQLite in local development and automated testing and only start using Postgres in the testing and production environment. This works great with abstraction layers such as SQLAlchemy and Django and allows me to postpone the decision for a specific database system for as long as possible. As a nice bonus it keeps the testing setup very simple as I can simple use an in memory SQLite.

The growing memory requirements were the point where I had to choose a specific database. As I use PostgreSQL virtually always in production systems I looked into the options it provided to optimize my queries.

The main problems where cause by the episode data, a ~9 GB table with +11 million rows. I already had indexes in place but the queries for the audio properties of podcasts still took +1 minute per plot. This meant that "live" queries where no longer possible and I needed a solution to store (partial) results.

I've chosen to use materialized views as they are a great fit for my problems. The data presented on the page doesn't have to be updated with every newly analyzed audio file. It is sufficient to get an update once per day (even once per week would be fine). The transformations required can easily be done in SQL.

Materialized views behave almost like a regular table, but you cannot directly create or update rows in the table. Instead the content is created from an SQL statement, executed once on creation or upon refreshing the view.

Creating a materialized view is simple:

create materialized view pub_hist 
select date(publishing_date), count(*)
from episode e
group by date(publishing_date)

This command executes the selection query and stores the result in the view pub_hist. Afterwards you can simply query from this view:

select *
from pub_hist
where "date" < now()

Other than normal views the content of the materialized view is not updated automatically. Instead you have to trigger a refresh, which will reevaluate the defined query and store the result:

REFRESH MATERIALIZED VIEW pub_hist;

To refresh the views I simply run a daily (celery) job, which calls the refresh command for all views.