SQL-04 13: Query Optimisation

4. Query Optimisation

4.1. The Query Planner

All relational databases employ a query planner to digest the raw SQL statement and turn it into executable steps prior to executing the query. But the planner isn’t perfect, and it can optimize some SQL statements better than others. The query planner parses SQL into execution steps and decides which indexes, if any, and which search strategies it should use. It bases its plans on various heuristics and on its knowledge of the data distribution. It knows something that you often don’t know: how your data is distributed at any point in time. It won’t, however, relieve you of having to write efficient queries.

4.2. Planner Statistics

The query planner relies on having up-to-date statistics about the table, these statistics are updated when the autovacuum process runs, you can also run vacuum analyse manually if you have just made a lot of changes to a table.

You can run this query (or one like it) if you are interested in finding the statistics that the query planner is using to decide which type of search to use:

SELECT attname AS colname,
n_distinct,
array_to_string(most_common_vals, E'\n') AS common_vals,
array_to_string(most_common_freqs, E'\n') AS dist_freq
FROM pg_stats
WHERE schemaname = 'crime' ANT tablename = 'street_crime' AND attname in ('crime_type', 'location')
ORDER BY colname;

This will give you a result like this:

I limited my query to just two column names but you could look at statistics for all of them if you like. Here I can see in the n_distinct column the exact number of categories if it is greater than 1 or, if it is less than 0, the percentage that are unique. The dist_freq column tells us how many of the rows in the table have that matching value (common_vals) . The planner will use this information to decide if it would be quicker sequentially searching through a table than using an index. If most values in the column are the same, and that common value is filtered for in a WHERE, the planner would generally opt for a sequential scan to retrieve the rows. It can also guess whether a nested loop is more efficient than a hash by looking at the WHERE and JOIN conditions of a query and estimating the number of results from each table. Unlike some other databases the PostgreSQL team prefer to work at improving the query planner rather than allowing the user to add “hints” to the query to influence the planner. So if you find you have a very slow query it is always worth asking on GIS Stackexchange, or similar site, to see if you are missing something or the query planner needs to be made “smarter”.

4.3. Common Table Expressions

If you use a CTE in your query it can have benefits for the query performance or slow it down depending on how big and complex the CTE is. PostgreSQL will normally materialise your CTE as a temporary table. If this is small enough it can be stored in memory and will speed up your query. However, if it is a huge table, then it may end up being written to disk and be slower to read back than an embedded sub-query. Also, when a CTE is treated as a single unit, the planner can not make use of any indexes that were set up on the original table nor can it take advantage of knowing which columns are used in the remainder of the query. A CTE is a black box and the planner can not see out of it or into it so, if you run in to performance issues when using CTEs, then you may get a speed boost by rewriting them using sub-queries instead.

4.4. Using EXPLAIN to Explore Performance Issues

Regina Obe in PostGIS in Action suggests:

“There are a few items you should look for when troubleshooting query performance:

  • What indexes, if any, are being used?

  • In what order are the indexes being applied?

  • What is the order of function evaluation?

  • What strategies are used: nested loop, hash join, merge join, bitmap, sequential scan?

  • What is the difference between calculated and actual costs?

  • How many rows are scanned?

  • Is the planner using parallel workers?

The EXPLAIN (and particularly the EXPLAIN ANALYSE) query gives you the information you need to find to answer those questions. You can run these queries by adding EXPLAIN or EXPLAIN ANALYSE to the beginning of the query you are investigating.

You can get either a graphical view of the query (in pgAdmin or DBeaver) or just study the text output by running in psql - if the text is too plain for you there is an online analyser that you can paste the output into.

A text output can look like this:

A graphical display of the same information (click the “graph” button or press shift-F7 in pgadmin) looks like:

Explore query performance

  1. Drop any indexes on general.parish and crime.street_crime (just so we can see the improvement in a moment)

  2. Get the query plan for the following query:
    SELECT p.name FROM general.parish p , crime.street_crime c
    WHERE st_intersects(p.wkb_geometry, c.wkb_geometry)
    AND crime_type ILIKE 'burglary';

  3. Now add a spatial index back to each of the tables, and get the new query plan, how much has the speed of the query improved? Does adding a index on crime.street_crime.crime_type help?