A question on stackoverflow.com prompted me to write this note.
https://stackoverflow.com/questions/54856702/postgresql-11-goes-for-parallel-seq-scan-on-partitioned-table-where-index-should
The
most popular answer in the post concluded that "PostgreSQL punts when
it sees a GROUP BY clause." And it also supplied the source code in
src/backend/optimizer/plan/planagg.c to convince that there is not much
point in optimizing MIN/MAX with GROUP BY.
I
thought that I might be able to work out a solution to improve the
performance of the query with ease. But it took me about a week to find a
solution.
I'll explain how we can make the problematic query faster in this note.
One
of the wonderful features I think PostgreSQL has is its "min/max"
optimization. Rather than talking through it, here is a little
demonstration.
CREATE TABLE PART (
C1 INT,
C2 VARCHAR(100),
C3 VARCHAR(10),
STATUS VARCHAR(1)
)
PARTITION BY LIST(C1)
;
CREATE TABLE PART1 PARTITION OF PART FOR VALUES IN(1);
CREATE TABLE PART2 PARTITION OF PART FOR VALUES IN(2);
CREATE TABLE PART3 PARTITION OF PART FOR VALUES IN(3);
CREATE TABLE PART4 PARTITION OF PART FOR VALUES IN(4);
INSERT INTO PART
SELECT i%4+1
, i::text||'WhoeverDoesNotLoveDoesNotKnowGodBecauseGodIsLove'
, substr(md5(random()::text),1,10)
, chr(65+i%3)
FROM GENERATE_SERIES(1,1000000) a(i);
CREATE INDEX PART_X01 ON PART(STATUS, C3);
CREATE INDEX PART_X02 ON PART(C1, STATUS);
CREATE TABLE NONPART AS SELECT * FROM PART;
CREATE INDEX NONPART_X01 ON NONPART(STATUS, C3);
I have
created a non-partitioned table "NONPART" and a partitioned table
"PART". Let's say that I want to get a minimum STATUS value and maximum
STATUS value. Then my query would be:
SELECT MIN(STATUS), MAX(STATUS) FROM PART;
If
you run the query in ORACLE, it is inevitable to do an index fast full
scan which has to read all the index blocks. And it also has to do a
sort aggregate by STATUS. The ORACLE execution plan would be as follows:
OPERATION NAME
-----------------------------------------------
SELECT STATEMENT
SORT AGGREGATE
INDEX FAST FULL SCAN PART_X01 --> this results in a lot of block I/Os
In ORACLE if you want to improve the response time, you have to rewrite it as follows.
SELECT MIN(MIN_VAL), MAX(MAX_VAL)
FROM (
SELECT MIN(STATUS) as MIN_VAL, NULL as MAX_VAL
FROM PART
UNION ALL
SELECT NULL AS MIN_VAL, MAX(STATUS) as MAX_VAL
FROM PART
)
;
Then
how would PostgreSQL get the minimum and maximum values of STATUS?
Below is the execution plan the optimizer produces with some cosmetic
editing.
Result (actual time=0.137..0.138 rows=1 loops=1)
Buffers: shared hit=32
InitPlan 1 (returns $0)
-> Limit (actual time=0.096..0.096 rows=1 loops=1)
Buffers: shared hit=16
-> Result (actual time=0.095..0.095 rows=1 loops=1)
Buffers: shared hit=16
-> Merge Append (actual time=0.094..0.094 rows=1 loops=1)
Sort Key: part.status
Buffers: shared hit=16
-> Index Only Scan using part1_status_c3_idx on part1 part_1 (actual time=0.016..0.016 rows=1 loops=1)
Index Cond: (status IS NOT NULL)
Heap Fetches: 0
Buffers: shared hit=4
-> Index Only Scan using part2_status_c3_idx on part2 part_2 (actual time=0.008..0.008 rows=1 loops=1)
Index Cond: (status IS NOT NULL)
Heap Fetches: 0
Buffers: shared hit=4
-> Index Only Scan using part3_status_c3_idx on part3 part_3 (actual time=0.007..0.007 rows=1 loops=1)
Index Cond: (status IS NOT NULL)
Heap Fetches: 0
Buffers: shared hit=4
-> Index Only Scan using part4_status_c3_idx on part4 part_4 (actual time=0.061..0.061 rows=1 loops=1)
Index Cond: (status IS NOT NULL)
Heap Fetches: 0
Buffers: shared hit=4
InitPlan 2 (returns $1)
-> Limit (actual time=0.038..0.038 rows=1 loops=1)
Buffers: shared hit=16
-> Result (actual time=0.037..0.038 rows=1 loops=1)
Buffers: shared hit=16
-> Merge Append (actual time=0.037..0.037 rows=1 loops=1)
Sort Key: part_5.status DESC
Buffers: shared hit=16
-> Index Only Scan Backward using part1_status_c3_idx on part1 part_6 (actual time=0.009..0.009 rows=1 loops=1)
Index Cond: (status IS NOT NULL)
Heap Fetches: 0
Buffers: shared hit=4
-> Index Only Scan Backward using part2_status_c3_idx on part2 part_7 (actual time=0.007..0.007 rows=1 loops=1)
Index Cond: (status IS NOT NULL)
Heap Fetches: 0
Buffers: shared hit=4
-> Index Only Scan Backward using part3_status_c3_idx on part3 part_8 (actual time=0.016..0.016 rows=1 loops=1)
Index Cond: (status IS NOT NULL)
Heap Fetches: 0
Buffers: shared hit=4
-> Index Only Scan Backward using part4_status_c3_idx on part4 part_9 (actual time=0.005..0.005 rows=1 loops=1)
Index Cond: (status IS NOT NULL)
Heap Fetches: 0
Buffers: shared hit=4
Planning Time: 0.249 ms
Execution Time: 0.189 ms
In
the above plan we can not see any aggregate operations. PostgreSQL
doesn't need to acquire a lot of data and sort it when it knows that the
"left hand" end of an index is the low values and the "right hand" end
is the high values. Surely with the same amount of data the block I/O
and the execution time are much smaller than those of ORACLE. The bigger
the data, the more huge the performance gap is.
In
the InitPlan 1 of the execution plan the optimizer could simply walk
down the index branches to the left hand leaf and look at the single
lowest entry in the leaf block to determine the lowest value for STATUS.
(In a partitioned table, PostgreSQL supports only local indexes. So it
had to access four local indexes.) The Limit is specifically
designed to stop reading the object early when the desired row, the
max/min value in this example, was extracted.
In
the InitPlan 2 of the execution plan the optimizer could simply walk
down the index branches to the right hand leaf and look at the single
highest entry in the leaf block to determine the highest value for
STATUS. It hit three index blocks and one heap block in each partition.
In
the future if PostgreSQL implements global indexes in a partitioned
table, it will be able to get the result without visiting every
partition.
Now let's assume that we should extract min(c3) and max(c3) per STATUS. Then the query would be like this:
--From now on I use a non-partitioned table for simplicity of explanation.
SELECT STATUS, MIN(C3), MAX(C3)
FROM NONPART
GROUP BY STATUS;
Since
we have an index on the columns (STATUS, C3), we would expect the
optimizer produces an execution plan using the index. It would access
the left-hand end of the index and the right-hand end of the index.
Now let's take a look at the execution plan.
Finalize GroupAggregate (actual time=208.101..210.502 rows=3 loops=1)
Group Key: status
Buffers: shared hit=12362
-> Gather Merge (actual time=208.093..210.491 rows=9 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=12362
-> Sort (actual time=199.000..199.000 rows=3 loops=3)
Sort Key: status
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=12362
Worker 0: Sort Method: quicksort Memory: 25kB
Worker 1: Sort Method: quicksort Memory: 25kB
-> Partial HashAggregate (actual time=198.967..198.968 rows=3 loops=3)
Group Key: status
Batches: 1 Memory Usage: 24kB
Buffers: shared hit=12346
Worker 0: Batches: 1 Memory Usage: 24kB
Worker 1: Batches: 1 Memory Usage: 24kB
-> Parallel Seq Scan on nonpart (actual time=0.006..37.536 rows=333333 loops=3)
Buffers: shared hit=12346
Planning Time: 0.082 ms
Execution Time: 210.534 ms
Disappointingly
the optimizer didn't use the index. PostgreSQL is not smart enough to
figure out it can use the index to speed up min()/max() with GROUP BY. So how can we work around this problem?
STATUS
has small number of distinct values. PostgreSQL can extract distinct
values super efficiently when the column is indexed. So I re-engineered
the query as follows:
WITH RECURSIVE W AS (
(SELECT STATUS FROM NONPART ORDER BY STATUS LIMIT 1)
UNION ALL
SELECT (SELECT STATUS
FROM NONPART A
WHERE A.STATUS > W.STATUS
ORDER BY A.STATUS LIMIT 1)
FROM W
WHERE STATUS IS NOT NULL
)
SELECT W.STATUS, B.*
FROM W
JOIN LATERAL
(SELECT MIN(C3) AS MIN_C3, MAX(C3) AS MAX_C3
FROM NONPART B
WHERE W.STATUS = B.STATUS
) B ON TRUE
WHERE W.STATUS IS NOT NULL;
The key thing in the above query is that I got rid of the GROUP BY clause. According to the PostgreSQL source code "src/backend/optimizer/plan/planagg.c", you cannot get a performance boost with GROUP BY.
The
SQL statement looks a little convoluted with the inline select
statement after JOIN LATERAL. But it does very small amounts of work and
it is only reading the leftmost leaf and the rightmost leaf of the
index. We know from the above example that PostgreSQL can execute the
MIN/MAX in the inline view after JOIN LATERAL very efficiently; we can
hope (and check) that the inline view driven by the single values from
the recursive row in W will operate just as efficiently - and here is
the plan:
Nested Loop (actual time=0.134..0.411 rows=3 loops=1)
Buffers: shared hit=25 read=9
CTE w
-> Recursive Union (actual time=0.082..0.289 rows=4 loops=1)
Buffers: shared hit=11 read=3
-> Limit (actual time=0.081..0.081 rows=1 loops=1)
Buffers: shared hit=1 read=3
-> Index Only Scan using nonpart_x01 on nonpart (actual time=0.080..0.080 rows=1 loops=1)
Heap Fetches: 0
Buffers: shared hit=1 read=3
-> WorkTable Scan on w w_1 (actual time=0.051..0.051 rows=1 loops=4)
Filter: (status IS NOT NULL)
Rows Removed by Filter: 0
Buffers: shared hit=10
SubPlan 1
-> Limit (actual time=0.005..0.006 rows=1 loops=3)
Buffers: shared hit=10
-> Index Only Scan using nonpart_x01 on nonpart a (actual time=0.005..0.005 rows=1 loops=3)
Index Cond: (status > (w_1.status)::text)
Heap Fetches: 0
Buffers: shared hit=10
-> CTE Scan on w (actual time=0.084..0.292 rows=3 loops=1)
Filter: (status IS NOT NULL)
Rows Removed by Filter: 1
Buffers: shared hit=11 read=3
-> Result (actual time=0.038..0.038 rows=1 loops=3)
Buffers: shared hit=14 read=6
InitPlan 3 (returns $4)
-> Limit (actual time=0.009..0.009 rows=1 loops=3) --read the leftmost leaf block of the index
Buffers: shared hit=10
-> Index Only Scan using nonpart_x01 on nonpart b (actual time=0.009..0.009 rows=1 loops=3)
Index Cond: ((status = ($3)::text) AND (c3 IS NOT NULL))
Heap Fetches: 0
Buffers: shared hit=10
InitPlan 4 (returns $5)
-> Limit (actual time=0.028..0.028 rows=1 loops=3) --read the rightmost leaf block of the index
Buffers: shared hit=4 read=6
-> Index Only Scan Backward using nonpart_x01 on nonpart b_1 (actual time=0.027..0.027 rows=1 loops=3)
Index Cond: ((status = ($3)::text) AND (c3 IS NOT NULL))
Heap Fetches: 0
Buffers: shared hit=4 read=6
Planning Time: 0.164 ms
Execution Time: 0.479 ms
To understand the revised query, you need to have knowledge of the following:
1. PostgreSQL can get distinct values efficiently using the Common Table Expression.
2. Using a lateral join we can run everything after the lateral for each row before the lateral.
3. PostgreSQL is super excellent in MIN/MAX optimization on an indexed column. (This was explained in the above example.)
If
we run the above query against a partitioned table(PART) instead of a
non-partitioned table(NONPART), we will do lots of individual probes
into the local indexes. Still they will stay very efficient using a
MIN/MAX access.
Wrap Up
PostgreSQL looks at all the rows when it implements GROUP BY even when there is an index on columns.
When we want to improve MIN/MAX with GROUP BY, we should find a method to remove GROUP BY in the SQL statement.