This past week I triggered an incident on one of the production systems. The issue was a result of a AWS Aurora Postgres major version upgrade (10->11) gone wrong. Here is a brief summary of what I learnt from it.
Every running DBMS has some state associated with the data it stores. This state enables the DBMS to tune it’s own performance.
Upgrading the DB is tricky for sure. Always try to create contingency plans (backups, snapshots, redundant DBs) before executing destructive operations. Migrating the data from one DB to another is part of the work. It’s rather straightforward thanks to Cloud Providers. Verifing and validating data correctness post-migration is simple.
However, the difficult part is to understand the DBMS state migration. Usually, this state is not migrated by default. DBAs need to force it post migration.
The inconsistency between the states pre and post migration resulted in slower queries.
Comparing the query plans can point out missing indexes, inefficient plans.
This is how I discovered that PostgreSQL’s query planner uses the statistics collected in pg_statistic
catalog table to determine the most efficient execution plans.1
Morever, PostgreSQL provides a special statement called ANALYZE
to populate the statistics into this table.2
Note: The ANALYZE
statement is not in the SQL standard.
There are considerations to be made around the ANALYZE
statement.
ANALYZE
requires only a read lock on the target table. This enables concurrent queries to run in parallel.- In default configuration,
AUTOVACCUM
daemon automates the execution of theANALYZE
statement. Check for theAUTOVACCUM
configuation of your DBs. - A common strategy for read-mostly databases is to run
ANALYZE
once a day during a low-usage time of day.
I’m still trying to figure out a good strategy to tune ANALYZE
on the production DBs for individual use cases.
But that’s a story for another time.
I have learnt this lesson about query planning the hard way. Now, there is wisdom, and some scar tissue.