This is just a lightweight note on how to improve performance of a query which checks if at least one record is present.
Recently a novice developer asked me to speed up a query. The query had codes like this:
SELECT A.PRD_ID, CASE WHEN (SELECT COUNT(*)
FROM PRODUCT_IMAGE B
WHERE DELETE_YN = 'N'
AND B.PRD_ID = A.PRD_ID) > 0
THEN 'Y'
ELSE 'N'
END AS EXIST_IMAGE_YN
FROM PRODUCT A
WHERE CASE WHEN (SELECT COUNT(*)
FROM PRODUCT_IMAGE B
WHERE DELETE_YN = 'N'
AND B.PRD_ID = A.PRD_ID) > 0
THEN 'Y'
ELSE 'N'
END = :var;
The
following is the ERD of those two tables in the query.
The developer told
me that we could allocate the value 'Y' or 'N' to the :var variable. When we assign 'Y' to :var, we
can get the PRD_IDs which have image files. When we assign 'N' to :var, we can get the PRD_IDs which don't have image files.
As
a proactive worker, the first question I asked was: how often is this query executed and how important is it to be as efficient as possible? I
didn't want to waste a couple of hours of coding and testing to save
just three seconds every 24 hours. Some context was needed before
charging into solution mode.
The next question I asked was: what's wrong with this code? And if it retrieves the result set you want, you are writing your query right.
My questions sounded sarcastic. But I was serious. The developer said that
he was writing the query to meet new business requirements and was not
sure how often the query would be run in the production environment. He
just wanted to write an optimal query in the
early stage of development and he was not happy with the query because
the same subquery showed up in the SELECT list and in the WHERE clause. The developer thought that the same subquery
in the SELECT list and in the
WHERE clause was the pain point that killed performance. And he was on the
brink of tears because his boss blamed him for not writing a performant
query. So I advised him to listen to music for a moment to relax. Here
is the URL of a music video I recommended him to listen to:
Since I was not busy, I decided to dive in improving performance.
In
this note I will supply a query pattern that may be useful to check if a record is present.
Here
is some code to create the test data set:
CREATE TABLE PRODUCT (
PRD_ID INT NOT NULL,
PRD_NM VARCHAR(100),
PRD_PRICE NUMERIC,
PRD_DESC VARCHAR(1000),
CONSTRAINT pk_PRODUCT PRIMARY KEY (PRD_ID)
);
CREATE TABLE PRODUCT_IMAGE (
PRD_ID INT NOT NULL NOT NULL,
IMG_SEQ INT NOT NULL NOT NULL,
IMG_FILE_NM VARCHAR(100),
IMG_FILE_PATH VARCHAR(100),
DELETE_YN VARCHAR(1),
CONSTRAINT PK_PRODUCT_IMAGE PRIMARY KEY (PRD_ID, IMG_SEQ)
);
INSERT INTO PRODUCT
SELECT i, 'ProductName'||i::text, mod(i,4000), repeat('z',100)
FROM generate_series(1,2000000) a(i);
INSERT INTO PRODUCT_IMAGE
SELECT A.PRD_ID, b.i, 'ImageFileName'||A.PRD_ID::text||b.i::text, repeat('path',2), 'N'
FROM PRODUCT A, generate_series(1,3) b(i)
WHERE MOD(A.PRD_ID,4) <> 3;
SELECT pg_relation_size('PRODUCT'), pg_relation_size('PRODUCT_IMAGE');
analyze product;
analyze product_image;
I
have disabled parallel processing for the purpose of simplicity and clarity of my explanation.
SET MAX_PARALLEL_WORKERS_PER_GATHER = 0;
I
have run the following query allocating 'Y' or 'N'
to the :var variable.
SELECT A.PRD_ID, CASE WHEN (SELECT COUNT(*)
FROM PRODUCT_IMAGE B
WHERE DELETE_YN = 'N'
AND B.PRD_ID = A.PRD_ID) > 0
THEN 'Y'
ELSE 'N'
END AS EXIST_IMAGE_YN
FROM PRODUCT A
WHERE CASE WHEN (SELECT COUNT(*)
FROM PRODUCT_IMAGE B
WHERE DELETE_YN = 'N'
AND B.PRD_ID = A.PRD_ID) > 0
THEN 'Y'
ELSE 'N'
END = :var;
When I assigned
'Y' to :var, I got the following execution plan:
Seq Scan on product a (cost=0.00..32225464.07 rows=10001 width=36) (actual time=0.057..7409.173 rows=1500000 loops=1)
Filter: (CASE WHEN ((SubPlan 2) > 0) THEN 'Y'::text ELSE 'N'::text END = 'Y'::text)
Rows Removed by Filter: 500000
Buffers: shared hit=19500352 read=39287
SubPlan 1
-> Aggregate (cost=15.99..16.00 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=1500000)
Buffers: shared hit=9000000
-> Index Scan
using pk_product_image on product_image b (cost=0.43..15.98 rows=3
width=0) (actual time=0.001..0.002 rows=3 loops=1500000)
Index Cond: (prd_id = a.prd_id)
Filter: ((delete_yn)::text = 'N'::text)
Buffers: shared hit=9000000
SubPlan 2
-> Aggregate (cost=15.99..16.00 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=2000000)
Buffers: shared hit=10500000
-> Index Scan
using pk_product_image on product_image b_1 (cost=0.43..15.98 rows=3
width=0) (actual time=0.001..0.001 rows=2 loops=2000000)
Index Cond: (prd_id = a.prd_id)
Filter: ((delete_yn)::text = 'N'::text)
Buffers: shared hit=10500000
Planning Time: 0.092 ms
Execution Time: 7473.390 ms
Can you spot the inefficiency in the plan? SubPlans are a bit similar to NestedLoop. These can be called many times. For every row that is returned by "Seq
Scan on product a", PostgreSQL has to run SubPlan2, which aggregates to
compute the count over the intermediate result set of "Index Scan"
node. Please pay your attention to "loops=20000000" in the Aggregate
line, which tells us that SubPlan2 was called 2 million times. We can
safely say that SubPlan1 was called 1500000 times by looking at "loops=1500000" in the Aggregate line below SubPlan1.
When I assigned
'N' to :var, I got the following execution plan:
Seq Scan on product a (cost=0.00..32225464.07 rows=10001 width=36) (actual time=0.082..4601.768 rows=500000 loops=1)
Filter: (CASE WHEN ((SubPlan 2) > 0) THEN 'Y'::text ELSE 'N'::text END = 'N'::text)
Rows Removed by Filter: 1500000
Buffers: shared hit=12000320 read=39319
SubPlan 1
-> Aggregate (cost=15.99..16.00 rows=1 width=8) (actual time=0.001..0.001 rows=1 loops=500000)
Buffers: shared hit=1500000
-> Index Scan
using pk_product_image on product_image b (cost=0.43..15.98 rows=3
width=0) (actual time=0.001..0.001 rows=0 loops=500000)
Index Cond: (prd_id = a.prd_id)
Filter: ((delete_yn)::text = 'N'::text)
Buffers: shared hit=1500000
SubPlan 2
-> Aggregate (cost=15.99..16.00 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=2000000)
Buffers: shared hit=10500000
-> Index Scan
using pk_product_image on product_image b_1 (cost=0.43..15.98 rows=3
width=0) (actual time=0.001..0.001
rows=2 loops=2000000)
Index Cond: (prd_id = a.prd_id)
Filter: ((delete_yn)::text = 'N'::text)
Buffers: shared hit=10500000
Planning Time: 0.109 ms
Execution Time: 4623.509 ms
Note that the query
accessed the pk_product_image index
2,000,000 times in the where clause and it also accessed the same index
500,000 times in the SELECT list. So huge amount of access (2,500,000
in total) to the index looks like a good culprit for the problem we are
trying to solve.
When
we take a closer
look at the subquery, the subquery is counting the number of rows in the
PRODUCT_IMAGE table in order to check if there are any rows present. Counting the number of rows is more expensive than checking the existence of a row in a table. Counting the number of
rows means that PostgreSQL has to search all the rows that match the
criteria. On the other hand, checking the existence of a row entails stopping processing further rows after finding the first instance of the row which matches the criteria. Basically the Exists clause is the best option to check the existence of a row. So,
assuming we've decided
that some form of check-for-existence strategy is desirable (surely you
may find a better strategy than this), here is one of the smarter
solutions.
SELECT A.PRD_ID
, CASE WHEN :var='Y' THEN 'Y'
WHEN :var='N' THEN 'N'
END AS EXIST_IMAGE_YN
FROM PRODUCT A
WHERE CASE WHEN (SELECT 1
WHERE EXISTS (SELECT 1
FROM PRODUCT_IMAGE B
WHERE DELETE_YN = 'N'
AND B.PRD_ID = A.PRD_ID)
) > 0
THEN 'Y'
ELSE 'N'
END = :var;
The first point to note is that the above query is logically quivalent to the original query. Note that I eliminated the subquery in the SELECT list and I had the optimizer check for existence of any rows in the PRODUCT_IMAGE table instead of counting the number of rows in the PRODUCT_IMAGE
table that satisfy some conditions.
Here's the execution plan obtained using the EXPLAIN (ANALYZE, BUFFERS) command.
Seq Scan on product a (cost=0.00..11319985.99 rows=10001 width=36) (actual time=0.054..2909.378 rows=1500000 loops=1)
Filter: (CASE WHEN ((SubPlan 2) > 0) THEN 'Y'::text ELSE 'N'::text END = 'Y'::text)
Rows Removed by Filter: 500000
Buffers: shared hit=7500256 read=39383
SubPlan 2
-> Result (cost=5.61..5.62 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=2000000)
One-Time Filter: $1
Buffers: shared hit=7500000
InitPlan 1 (returns $1)
-> Index Scan using pk_product_image on
product_image b (cost=0.43..15.98 rows=3 width=0) (actual
time=0.001..0.001
rows=1 loops=2000000)
Index Cond: (prd_id = a.prd_id)
Filter: ((delete_yn)::text = 'N'::text)
Buffers: shared hit=7500000
Planning Time: 0.122 ms
Execution Time: 2968.715 ms
The first thing to note is that the Aggregate node was removed. However, the execution plan is still nagging us with SubPlan2. "Result" node gets added when a query selects some constant values
(SELECT 1 in the above subquery). Please note "loops=2000000" in the
"Result" node, which indicates that SubPlan2 was called 2 million times.
You might think that accessing the pk_product_image
index 2,000,000 times is outrageous in terms of query performance.
Before reading on this article, think for a minute about how we can
improve performance.
Here is a more cunning option - I tried to induce the optimizer
to use a hash semi join or a hash anti join.
SELECT A.PRD_ID, CASE WHEN :VAL ='Y' THEN 'Y' ELSE 'N' END AS EXIST_IMAGE_YN
FROM PRODUCT A
WHERE (EXISTS (SELECT 1
FROM PRODUCT_IMAGE B
WHERE B.DELETE_YN = 'N'
AND A.PRD_ID = B.PRD_ID)
AND 'Y' = :VAL
)
OR (NOT EXISTS (SELECT 1
FROM PRODUCT_IMAGE B
WHERE B.DELETE_YN = 'N'
AND A.PRD_ID = B.PRD_ID)
AND 'N' = :VAL
)
;
When we allocate 'Y' to the :VAL variable, we get the following plan.
Seq Scan on product a (cost=0.00..11289982.41 rows=1000120 width=36) (actual time=0.070..2295.409 rows=1500000 loops=1)
Filter: (SubPlan 1)
Rows Removed by Filter: 500000
Buffers: shared hit=7500929 read=38710
SubPlan 1
-> Index Scan using pk_product_image on product_image b
(cost=0.43..15.98 rows=3 width=0) (actual time=0.001..0.001 rows=1
loops=2000000)
Index Cond: (prd_id = a.prd_id)
Filter: ((delete_yn)::text = 'N'::text)
Buffers: shared hit=7500000
Planning Time: 0.116 ms
Execution Time: 2350.457 ms
Note that compared with the original query the number of block IOs decreased from (19,500,352 + 39,287)
to (7,500,929 + 38710 )
and the elapsed time dropped from 7473 to 2350 ms. However, we failed
to prompt the optimizer to use a hash semi join. The thing is that Or conditions in database engines like Oracle or PostgreSQL tend to introduce a negative impact on performance.
Sometimes
they are not bad, but I have had queries that went from taking minutes
to lightening fast because of Or conditions. One possible solution is to
change the OR into a UNION or even a UNION ALL. So I rewrote the query as follows:
SELECT A.PRD_ID, 'Y' AS EXIST_IMAGE_YN
FROM PRODUCT A
WHERE EXISTS (SELECT 1
FROM PRODUCT_IMAGE B
WHERE B.DELETE_YN = 'N'
AND A.PRD_ID = B.PRD_ID)
AND 'Y' = :VAL
UNION ALL
SELECT A.PRD_ID, 'N' AS EXIST_IMAGE_YN
FROM PRODUCT A
WHERE NOT EXISTS (SELECT 1
FROM PRODUCT_IMAGE B
WHERE B.DELETE_YN = 'N'
AND A.PRD_ID = B.PRD_ID)
AND 'N' = :VAL;
When
you allocate 'Y' to the :VAL variable, you get the following plan.
Hash Semi Join (cost=169865.65..285050.94 rows=1380628 width=36) (actual time=939.736..1810.502 rows=1500000 loops=1)
Hash Cond: (a.prd_id = b.prd_id)
Buffers: shared hit=40516 read=38902, temp read=18482 written=18482
-> Seq Scan on product a (cost=0.00..59641.39 rows=2000239 width=4) (actual time=0.012..195.315 rows=2000000 loops=1)
Buffers: shared hit=737 read=38902
-> Hash (cost=96032.33..96032.33 rows=4500266 width=4) (actual time=938.828..938.828 rows=4500000 loops=1)
Buckets: 262144 Batches: 32 Memory Usage: 6974kB
Buffers: shared hit=39779, temp written=12759
-> Seq Scan on product_image b (cost=0.00..96032.33
rows=4500266 width=4) (actual time=0.022..482.027 rows=4500000 loops=1)
Filter: ((delete_yn)::text = 'N'::text)
Buffers: shared hit=39779
Planning:
Buffers: shared hit=16
Planning Time: 0.423 ms
Execution Time: 1853.102 ms
When
you allocate 'N' to the :VAL variable, you get the following plan.
Hash Anti Join (cost=169865.65..277440.77 rows=619611 width=36) (actual time=961.812..1784.279 rows=500000 loops=1)
Hash Cond: (a.prd_id = b.prd_id)
Buffers: shared hit=40484 read=38934, temp read=18482 written=18482
-> Seq Scan on product a (cost=0.00..59641.39 rows=2000239 width=4) (actual time=0.010..200.522 rows=2000000 loops=1)
Buffers: shared hit=705 read=38934
-> Hash (cost=96032.33..96032.33 rows=4500266 width=4) (actual time=960.595..960.595 rows=4500000 loops=1)
Buckets: 262144 Batches: 32 Memory Usage: 6974kB
Buffers: shared hit=39779, temp written=12759
-> Seq Scan on product_image b (cost=0.00..96032.33
rows=4500266 width=4) (actual time=0.091..495.088 rows=4500000 loops=1)
Filter: ((delete_yn)::text = 'N'::text)
Buffers: shared hit=39779
Planning:
Buffers: shared hit=16
Planning Time: 0.278 ms
Execution Time: 1799.066 ms
As we can see, we succeeded in getting rid of nagging SubPlans. You are kindly advised to see the
changes of estimated costs in the plan as we tweak the query. With the
revised query we can observe that the number
of block IOs and the elapsed time droppped.
This check-for-existence strategy with a UNION ALL
is efficient on three counts - first that PostgreSQL runs the upper branch
or the lower branch depending on the :VAL value assigned; second that
PostgreSQL uses a hash semi join or a hash
anti join, which causes the optimizer to avoid repetitive access
to the index; third that PostgreSQL takes away aggregate operations.
Conclusion
When
you see a query which counts the number of rows, examine whether you
can change the SQL statement into checking the existence of a row.