PostgreSQL
Optimizing a Database Query with Functional Indexes in Postgres
 
A great way to mitigate slow queries is to add indexes to your lookup columns
This greatly increases efficiency when there's a very large record count. Adding a column index is super easy to do, but what about a computed property? It turns out that the Postgres database supports adding functional indexes. In fact, I learned about them recently while fixing a bug on Today I Learned.
The Problem
Perhaps you're familiar with Today I Learned, TILEX/TIL for short. TIL is our open-source microblogging platform where we share little tidbits that we learn everyday. One of the features of TILEX is weekly Slack reporting for page views. Once a week, TILEX will run a job to summarize the previous weeks traffic and post the results to an internal Slack channel. It does this by way of our requests table, which at the time of this post has a few million rows.
We hadn't seen our weekly summary in a long time. So I started to dig around and I found a timeout error in AppSignal. Following the stacktrace led me to page_views_report.ex:9.
Here, you can see that we were executing the query we use to build up our page view report data.
(
    (
        select
            count(*),
            date_trunc('day', request_time at time zone 'america/chicago'),
            'day' as period
        from requests
        where request_time at time zone 'america/chicago'
            between date_trunc('week', now() at time zone 'america/chicago') - '2 weeks'::interval and date_trunc('week', now() at time zone 'america/chicago')
        group by date_trunc('day', request_time at time zone 'america/chicago')
        order by date_trunc desc
    )
    union
    (
        select
            count(*),
            date_trunc('week', request_time at time zone 'america/chicago'),
            'week' as period from requests
        where request_time at time zone 'america/chicago'
            between date_trunc('week', now() at time zone 'america/chicago') - '2 weeks'::interval and date_trunc('week', now() at time zone 'america/chicago')
        group by date_trunc('week', request_time at time zone 'america/chicago')
        order by date_trunc desc
    )
)
order by date_trunc desc
Running this locally, I was able to see that query took a long time to execute, sometimes around 15 seconds. As you can probably imagine, this execution time was way worse in production, and was the exact cause of the timeout error in AppSignal.
From a cursory glance at the query, it's reasonable to assume that we need an index on request time. After checking the table, I verified that we had an index on request_time:
tilex_dev=# \d requests
                         Table "public.requests"
    Column    |           Type           | Collation | Nullable | Default
--------------+--------------------------+-----------+----------+---------
 page         | text                     |           | not null |
 request_time | timestamp with time zone |           |          | now()
Indexes:
    "requests_page_index" btree (page)
    "requests_request_time_index" btree (request_time)
But using explain analyze on the query led to an interesting discovery: we weren't hitting our index on request_time.
                                                                                                                                                            QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=328290.04..328419.75 rows=51884 width=48) (actual time=10057.124..10057.124 rows=0 loops=1)
   Sort Key: (date_trunc('day'::text, timezone('america/chicago'::text, requests.request_time))) DESC
   Sort Method: quicksort  Memory: 25kB
   ->  HashAggregate  (cost=323707.90..324226.74 rows=51884 width=48) (actual time=10057.118..10057.118 rows=0 loops=1)
         Group Key: (count(*)), (date_trunc('day'::text, timezone('america/chicago'::text, requests.request_time))), ('day'::text)
         ->  Append  (cost=158034.58..323318.77 rows=51884 width=48) (actual time=10056.813..10056.813 rows=0 loops=1)
               ->  Finalize GroupAggregate  (cost=158034.58..161270.26 rows=25942 width=48) (actual time=5022.033..5022.033 rows=0 loops=1)
                     Group Key: (date_trunc('day'::text, timezone('america/chicago'::text, requests.request_time)))
                     ->  Gather Merge  (cost=158034.58..160773.04 rows=21618 width=16) (actual time=5022.031..5022.070 rows=0 loops=1)
                           Workers Planned: 2
                           Workers Launched: 2
                           ->  Partial GroupAggregate  (cost=157034.56..157277.76 rows=10809 width=16) (actual time=5011.334..5011.334 rows=0 loops=3)
                                 Group Key: (date_trunc('day'::text, timezone('america/chicago'::text, requests.request_time)))
                                 ->  Sort  (cost=157034.56..157061.58 rows=10809 width=8) (actual time=5011.330..5011.330 rows=0 loops=3)
                                       Sort Key: (date_trunc('day'::text, timezone('america/chicago'::text, requests.request_time))) DESC
                                       Sort Method: quicksort  Memory: 25kB
                                       Worker 0:  Sort Method: quicksort  Memory: 25kB
                                       Worker 1:  Sort Method: quicksort  Memory: 25kB
                                       ->  Parallel Seq Scan on requests  (cost=0.00..156310.36 rows=10809 width=8) (actual time=5011.228..5011.228 rows=0 loops=3)
                                             Filter: ((timezone('america/chicago'::text, request_time) <= date_trunc('week'::text, timezone('america/chicago'::text, now()))) AND (timezone('america/chicago'::text, request_time) >= (date_trunc('week'::text, timezone('america/chicago'::text, now())) - '14 days'::interval)))
                                             Rows Removed by Filter: 1729479
               ->  Finalize GroupAggregate  (cost=158034.58..161270.26 rows=25942 width=48) (actual time=5034.776..5034.776 rows=0 loops=1)
                     Group Key: (date_trunc('week'::text, timezone('america/chicago'::text, requests_1.request_time)))
                     ->  Gather Merge  (cost=158034.58..160773.04 rows=21618 width=16) (actual time=5034.773..5036.138 rows=0 loops=1)
                           Workers Planned: 2
                           Workers Launched: 2
                           ->  Partial GroupAggregate  (cost=157034.56..157277.76 rows=10809 width=16) (actual time=5030.673..5030.673 rows=0 loops=3)
                                 Group Key: (date_trunc('week'::text, timezone('america/chicago'::text, requests_1.request_time)))
                                 ->  Sort  (cost=157034.56..157061.58 rows=10809 width=8) (actual time=5030.670..5030.670 rows=0 loops=3)
                                       Sort Key: (date_trunc('week'::text, timezone('america/chicago'::text, requests_1.request_time))) DESC
                                       Sort Method: quicksort  Memory: 25kB
                                       Worker 0:  Sort Method: quicksort  Memory: 25kB
                                       Worker 1:  Sort Method: quicksort  Memory: 25kB
                                       ->  Parallel Seq Scan on requests requests_1  (cost=0.00..156310.36 rows=10809 width=8) (actual time=5030.598..5030.598 rows=0 loops=3)
                                             Filter: ((timezone('america/chicago'::text, request_time) <= date_trunc('week'::text, timezone('america/chicago'::text, now()))) AND (timezone('america/chicago'::text, request_time) >= (date_trunc('week'::text, timezone('america/chicago'::text, now())) - '14 days'::interval)))
                                             Rows Removed by Filter: 1729479
 Planning Time: 0.892 ms
 Execution Time: 10058.969 ms
(38 rows)
The part that I found interesting was both of the Parallel Seq Scan on requests. The cost for this part of the query looked high, relative to others, and it was not making use of the index on requests.
But of course, that's because the where clause of the query was usingrequest_time at 'america/chicago', which you can see in the query plan above.
The Solution
Luckily, Postgres has functional indexes, and this fit the bill perfectly - we needed to index the request_time column in the display time zone.
create index concurrently index_requests_on_request_time_in_chicago on requests (timezone('america/chicago', request_time))
After creating the index and re-running with explain analyze, we got a massive improvement locally. Going from over 10 seconds to less than 3 milliseconds. This exponential speed up carried over to production as well. After running some tests, we were seeing the query execute in under 1 second in a production environment.
With explain-analyze, we also see that our query is now using our index: (Below output is truncated for your eyes)
->  Bitmap Index Scan on index_requests_on_request_time_in_chicago  (cost=0.00..547.87 rows=25942 width=0) (actual time=0.045..0.045 rows=0 loops=1)
After deploying our fix, we're back to our regular weekly Slackbot notifications! Hooray!
You can find out more about functional indexes in the Postgres docs.
And if you're curious about this particular change to TIL, you can check out the since merged PR.
Until next time!
 
