설문조사
PostgreSQL/PPAS 관련 듣고 싶은 교육은


총 게시물 162건, 최근 0 건
   

check-for-existence strategy

글쓴이 : 모델광 날짜 : 2022-11-12 (토) 12:28 조회 : 903
This is just a lightweight note on how to improve performance of a query which checks if at least one record is present.
Recently a novice developer asked me to speed up a query. The query had codes like this:

SELECT A.PRD_ID, CASE WHEN (SELECT COUNT(*)
                                        FROM PRODUCT_IMAGE B
                   WHERE DELETE_YN = 'N'
                     AND B.PRD_ID = A.PRD_ID) > 0
            THEN 'Y'
            ELSE 'N'
       END AS EXIST_IMAGE_YN
  FROM PRODUCT A
 WHERE CASE WHEN (SELECT COUNT(*)
                    FROM PRODUCT_IMAGE B
                   WHERE DELETE_YN = 'N'
                     AND B.PRD_ID = A.PRD_ID) > 0
            THEN 'Y'
            ELSE 'N'
       END = :var;

The following is the ERD of those two tables in the query.

The developer told me that we could allocate the value 'Y' or 'N' to the :var variable. When we assign 'Y' to :var, we can get the PRD_IDs which have image files. When we assign 'N' to :var, we can get the PRD_IDs which don't have image files.

As a proactive worker, the first question I asked was: how often is this query executed and how important is it to be as efficient as possible? I didn't want to waste a couple of hours of coding and testing to save just three seconds every 24 hours. Some context was needed before charging into solution mode.
The next question I asked was: what's wrong with this code? And if it retrieves the result set you want, you are writing your query right.
My questions sounded sarcastic. But I was serious. The developer said that he was writing the query to meet new business requirements and was not sure how often the query would be run in the production environment. He just wanted to write an optimal query in the early stage of development and he was not happy with the query because the same subquery showed up in the SELECT list and in the WHERE clause. The developer thought that the same subquery in the SELECT list and in the WHERE clause was the pain point that killed performance. And he was on the brink of tears because his boss blamed him for not writing a performant query. So I advised him to listen to music for a moment to relax. Here is the URL of a music video I recommended him to listen to:

In this note I will supply a query pattern that may be useful to check if a record is present.

Here is some code to create the test data set:

CREATE TABLE PRODUCT (
PRD_ID   INT NOT NULL,
PRD_NM   VARCHAR(100),
PRD_PRICE  NUMERIC,
PRD_DESC   VARCHAR(1000),
CONSTRAINT pk_PRODUCT PRIMARY KEY (PRD_ID)
);

CREATE TABLE PRODUCT_IMAGE (
PRD_ID   INT NOT NULL NOT NULL,
IMG_SEQ  INT NOT NULL NOT NULL,
IMG_FILE_NM  VARCHAR(100),
IMG_FILE_PATH VARCHAR(100),
DELETE_YN    VARCHAR(1),
CONSTRAINT PK_PRODUCT_IMAGE PRIMARY KEY (PRD_ID, IMG_SEQ)
);

INSERT INTO PRODUCT
SELECT i, 'ProductName'||i::text, mod(i,4000), repeat('z',100)
  FROM generate_series(1,2000000) a(i);

INSERT INTO PRODUCT_IMAGE
SELECT A.PRD_ID, b.i, 'ImageFileName'||A.PRD_ID::text||b.i::text, repeat('path',2), 'N'
  FROM PRODUCT A, generate_series(1,3) b(i)
 WHERE MOD(A.PRD_ID,4) <> 3;
 
SELECT pg_relation_size('PRODUCT'), pg_relation_size('PRODUCT_IMAGE');

analyze product;
analyze product_image;

I have disabled parallel processing for the purpose of simplicity and clarity of my explanation.

SET MAX_PARALLEL_WORKERS_PER_GATHER = 0; 

I have run the following query allocating 'Y' or 'N' to the :var variable.

SELECT A.PRD_ID, CASE WHEN (SELECT COUNT(*)
                    FROM PRODUCT_IMAGE B
                   WHERE DELETE_YN = 'N'
                     AND B.PRD_ID = A.PRD_ID) > 0
            THEN 'Y'
            ELSE 'N'
       END AS EXIST_IMAGE_YN
  FROM PRODUCT A
 WHERE CASE WHEN (SELECT COUNT(*)
                    FROM PRODUCT_IMAGE B
                   WHERE DELETE_YN = 'N'
                     AND B.PRD_ID = A.PRD_ID) > 0
            THEN 'Y'
            ELSE 'N'
       END = :var;

When I  assigned 'Y' to :var, I got the following execution plan:
     
Seq Scan on product a  (cost=0.00..32225464.07 rows=10001 width=36) (actual time=0.057..7409.173 rows=1500000 loops=1)
  Filter: (CASE WHEN ((SubPlan 2) > 0) THEN 'Y'::text ELSE 'N'::text END = 'Y'::text)
  Rows Removed by Filter: 500000
  Buffers: shared hit=19500352 read=39287
  SubPlan 1
    ->  Aggregate  (cost=15.99..16.00 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=1500000)
          Buffers: shared hit=9000000
          ->  Index Scan using pk_product_image on product_image b  (cost=0.43..15.98 rows=3 width=0) (actual time=0.001..0.002 rows=3 loops=1500000)
                Index Cond: (prd_id = a.prd_id)
                Filter: ((delete_yn)::text = 'N'::text)
                Buffers: shared hit=9000000
  SubPlan 2
    ->  Aggregate  (cost=15.99..16.00 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=2000000)
          Buffers: shared hit=10500000
          ->  Index Scan using pk_product_image on product_image b_1  (cost=0.43..15.98 rows=3 width=0) (actual time=0.001..0.001 rows=2 loops=2000000)
                Index Cond: (prd_id = a.prd_id)
                Filter: ((delete_yn)::text = 'N'::text)
                Buffers: shared hit=10500000
Planning Time: 0.092 ms
Execution Time: 7473.390 ms

Can you spot the inefficiency in the plan? SubPlans are a bit similar to NestedLoop. These can be called many times. For every row that is returned by "Seq Scan on product a", PostgreSQL has to run SubPlan2, which aggregates to compute the count over the intermediate result set of "Index Scan" node. Please pay your attention to "loops=20000000" in the Aggregate line, which tells us that SubPlan2 was called 2 million times. We can safely say that SubPlan1 was called 1500000 times by looking at "loops=1500000" in the Aggregate line below SubPlan1.

When I assigned 'N' to :var, I got the following execution plan:

Seq Scan on product a  (cost=0.00..32225464.07 rows=10001 width=36) (actual time=0.082..4601.768 rows=500000 loops=1)
  Filter: (CASE WHEN ((SubPlan 2) > 0) THEN 'Y'::text ELSE 'N'::text END = 'N'::text)
  Rows Removed by Filter: 1500000
  Buffers: shared hit=12000320 read=39319
  SubPlan 1
    ->  Aggregate  (cost=15.99..16.00 rows=1 width=8) (actual time=0.001..0.001 rows=1 loops=500000)
          Buffers: shared hit=1500000
          ->  Index Scan using pk_product_image on product_image b  (cost=0.43..15.98 rows=3 width=0) (actual time=0.001..0.001 rows=0 loops=500000)
                Index Cond: (prd_id = a.prd_id)
                Filter: ((delete_yn)::text = 'N'::text)
                Buffers: shared hit=1500000
  SubPlan 2
    ->  Aggregate  (cost=15.99..16.00 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=2000000)
          Buffers: shared hit=10500000
          ->  Index Scan using pk_product_image on product_image b_1  (cost=0.43..15.98 rows=3 width=0) (actual time=0.001..0.001 rows=2 loops=2000000)
                Index Cond: (prd_id = a.prd_id)
                Filter: ((delete_yn)::text = 'N'::text)
                Buffers: shared hit=10500000
Planning Time: 0.109 ms
Execution Time: 4623.509 ms

Note that the query accessed the pk_product_image index 2,000,000 times in the where clause and it also accessed the same index 500,000 times in the SELECT list. So huge amount of access (2,500,000 in total) to the index looks like a good culprit for the problem we are trying to solve.

When we take a closer look at the subquery, the subquery is counting the number of rows in the PRODUCT_IMAGE table in order to check if there are any rows present. Counting the number of rows is more expensive than checking the existence of a row in a table. Counting the number of rows means that PostgreSQL has to search all the rows that match the criteria. On the other hand, checking the existence of a row entails stopping processing further rows after finding the first instance of the row which matches the criteria. Basically the Exists clause is the best option to check the existence of a row. So, assuming we've decided that some form of check-for-existence strategy is desirable (surely you may find a better strategy than this), here is one of the smarter solutions.

SELECT A.PRD_ID
      , CASE WHEN :var='Y' THEN 'Y' 
             WHEN :var='N' THEN 'N'
        END AS EXIST_IMAGE_YN
  FROM PRODUCT A
 WHERE CASE WHEN (SELECT 1
                   WHERE EXISTS (SELECT 1
                                  FROM PRODUCT_IMAGE B
                                 WHERE DELETE_YN = 'N'
                                   AND B.PRD_ID = A.PRD_ID)
                 ) > 0
            THEN 'Y'
            ELSE 'N'
       END = :var;

The first point to note is that the above query is logically quivalent to the original query. Note that I eliminated the subquery in the SELECT list and I had the optimizer check for existence of any rows in the PRODUCT_IMAGE table instead of counting the number of rows in the PRODUCT_IMAGE table that satisfy some conditions.

Here's the execution plan obtained using the EXPLAIN (ANALYZE, BUFFERS) command.

Seq Scan on product a  (cost=0.00..11319985.99 rows=10001 width=36) (actual time=0.054..2909.378 rows=1500000 loops=1)
  Filter: (CASE WHEN ((SubPlan 2) > 0) THEN 'Y'::text ELSE 'N'::text END = 'Y'::text)
  Rows Removed by Filter: 500000
  Buffers: shared hit=7500256 read=39383
  SubPlan 2
    ->  Result  (cost=5.61..5.62 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=2000000)
          One-Time Filter: $1
          Buffers: shared hit=7500000
          InitPlan 1 (returns $1)
            ->  Index Scan using pk_product_image on product_image b  (cost=0.43..15.98 rows=3 width=0) (actual time=0.001..0.001 rows=1 loops=2000000)
                  Index Cond: (prd_id = a.prd_id)
                  Filter: ((delete_yn)::text = 'N'::text)
                  Buffers: shared hit=7500000
Planning Time: 0.122 ms
Execution Time: 2968.715 ms

The first thing to note is that the Aggregate node was removed. However, the execution plan is still nagging us with SubPlan2 "Result" node gets added when a query selects some constant values (SELECT 1 in the above subquery). Please note "loops=2000000" in the "Result" node, which indicates that SubPlan2 was called 2 million times. You might think that accessing the pk_product_image index 2,000,000 times is outrageous in terms of query performance. Before reading on this article, think for a minute about how we can improve performance.

Here is a more cunning option - I tried to induce the optimizer to use a hash semi join or a hash anti join.

SELECT A.PRD_ID, CASE WHEN :VAL ='Y' THEN 'Y' ELSE 'N' END AS EXIST_IMAGE_YN
  FROM PRODUCT A
 WHERE (EXISTS (SELECT 1
                 FROM PRODUCT_IMAGE B
                WHERE B.DELETE_YN = 'N'
                  AND A.PRD_ID = B.PRD_ID)
            AND 'Y' = :VAL
        )
    OR (NOT EXISTS (SELECT 1
                 FROM PRODUCT_IMAGE B
                WHERE B.DELETE_YN = 'N'
                  AND A.PRD_ID = B.PRD_ID)
            AND 'N' = :VAL
       )
;

When we allocate 'Y' to the :VAL variable, we get the following plan.

Seq Scan on product a  (cost=0.00..11289982.41 rows=1000120 width=36) (actual time=0.070..2295.409 rows=1500000 loops=1)
  Filter: (SubPlan 1)
  Rows Removed by Filter: 500000
  Buffers: shared hit=7500929 read=38710
  SubPlan 1
    ->  Index Scan using pk_product_image on product_image b  (cost=0.43..15.98 rows=3 width=0) (actual time=0.001..0.001 rows=1 loops=2000000)
          Index Cond: (prd_id = a.prd_id)
          Filter: ((delete_yn)::text = 'N'::text)
          Buffers: shared hit=7500000
Planning Time: 0.116 ms
Execution Time: 2350.457 ms

Note that compared with the original query the number of block IOs decreased from (19,500,352 + 39,287) to (7,500,929 + 38710 ) and the elapsed time dropped from 7473 to 2350 ms. However, we failed to prompt the optimizer to use a hash semi join. The thing is that Or conditions in database engines like Oracle or PostgreSQL tend to introduce a negative impact on performance. Sometimes they are not bad, but I have had queries that went from taking minutes to lightening fast because of Or conditions. One possible solution is to change the OR into a UNION or even a UNION ALL. So I rewrote the query as follows:

SELECT A.PRD_ID, 'Y' AS EXIST_IMAGE_YN
  FROM PRODUCT A
 WHERE EXISTS (SELECT 1
                 FROM PRODUCT_IMAGE B
                WHERE B.DELETE_YN = 'N'
                  AND A.PRD_ID = B.PRD_ID)
   AND 'Y' = :VAL
UNION ALL
SELECT A.PRD_ID, 'N' AS EXIST_IMAGE_YN
  FROM PRODUCT A
 WHERE NOT EXISTS (SELECT 1
                 FROM PRODUCT_IMAGE B
                WHERE B.DELETE_YN = 'N'
                  AND A.PRD_ID = B.PRD_ID)
   AND 'N' = :VAL;

When you allocate 'Y' to the :VAL variable, you get the following plan.

Hash Semi Join  (cost=169865.65..285050.94 rows=1380628 width=36) (actual time=939.736..1810.502 rows=1500000 loops=1)
  Hash Cond: (a.prd_id = b.prd_id)
  Buffers: shared hit=40516 read=38902, temp read=18482 written=18482
  ->  Seq Scan on product a  (cost=0.00..59641.39 rows=2000239 width=4) (actual time=0.012..195.315 rows=2000000 loops=1)
        Buffers: shared hit=737 read=38902
  ->  Hash  (cost=96032.33..96032.33 rows=4500266 width=4) (actual time=938.828..938.828 rows=4500000 loops=1)
        Buckets: 262144  Batches: 32  Memory Usage: 6974kB
        Buffers: shared hit=39779, temp written=12759
        ->  Seq Scan on product_image b  (cost=0.00..96032.33 rows=4500266 width=4) (actual time=0.022..482.027 rows=4500000 loops=1)
              Filter: ((delete_yn)::text = 'N'::text)
              Buffers: shared hit=39779
Planning:
  Buffers: shared hit=16
Planning Time: 0.423 ms
Execution Time: 1853.102 ms

When you allocate 'N' to the :VAL variable, you get the following plan.

Hash Anti Join  (cost=169865.65..277440.77 rows=619611 width=36) (actual time=961.812..1784.279 rows=500000 loops=1)
  Hash Cond: (a.prd_id = b.prd_id)
  Buffers: shared hit=40484 read=38934, temp read=18482 written=18482
  ->  Seq Scan on product a  (cost=0.00..59641.39 rows=2000239 width=4) (actual time=0.010..200.522 rows=2000000 loops=1)
        Buffers: shared hit=705 read=38934
  ->  Hash  (cost=96032.33..96032.33 rows=4500266 width=4) (actual time=960.595..960.595 rows=4500000 loops=1)
        Buckets: 262144  Batches: 32  Memory Usage: 6974kB
        Buffers: shared hit=39779, temp written=12759
        ->  Seq Scan on product_image b  (cost=0.00..96032.33 rows=4500266 width=4) (actual time=0.091..495.088 rows=4500000 loops=1)
              Filter: ((delete_yn)::text = 'N'::text)
              Buffers: shared hit=39779
Planning:
  Buffers: shared hit=16
Planning Time: 0.278 ms
Execution Time: 1799.066 ms

As we can see, we succeeded in getting rid of nagging SubPlans. You are kindly advised to see the changes of estimated costs in the plan as we tweak the query. With the revised query we can observe that the number of block IOs and the elapsed time droppped.
This check-for-existence strategy with a UNION ALL is efficient on three counts - first that PostgreSQL runs the upper branch or the lower branch depending on the :VAL value assigned; second that PostgreSQL uses a hash semi join or a hash anti join, which causes the optimizer to avoid repetitive access to the index; third that PostgreSQL takes away aggregate operations.


Conclusion
When you see a query which counts the number of rows, examine whether you can change the SQL statement into checking the existence of a row.

   

postgresdba.com