설문조사
PostgreSQL/PPAS 관련 듣고 싶은 교육은


총 게시물 124건, 최근 0 건
   

Speeding Up Subquery with OR 3

글쓴이 : 모델광 날짜 : 2022-10-03 (월) 21:33 조회 : 135

This is the third example to demonstrate the tuning technique:

"When there are complex OR predicates, rewrite the SQL as UNION ALL queries."

We seperate the two parts of the OR predicates to help the optimzer build an efficient plan with ease.


As an experienced Oracle SQL tuner, it is a lot of fun to find out the difference between two different DBMS optimizers. In this note I will offer a case where PostgreSQL is superior to Oracle in producing an efficient execution plan. This post assumes some basic familiarity with reading the excution plans of Oracle and PostgreSQL.


I supply some scripts below to explain why PostgreSQL is superior to Oracle in a specific case(Filter operation). Before we can get started we need to create some sample data:


CREATE TABLE T1

AS

SELECT i   id,

      mod(i,371) n1,

      lpad(i::varchar,10,'0') v1,

      lpad('x',100,'x') dummy

  FROM generate_series(1,1000000) a(i);


ALTER TABLE T1 ADD CONSTRAINT T1_PK PRIMARY KEY(ID);


CREATE TABLE T2 AS SELECT * FROM T1;

CREATE TABLE T3 AS SELECT * FROM T1, generate_series(1,10) a(i);


SELECT  PG_RELATION_SIZE('t1'), PG_RELATION_SIZE('t2'), PG_RELATION_SIZE('t3')

 pg_relation_size | pg_relation_size | pg_relation_size

------------------+------------------+------------------

        148946944 |        148946944 |       1575387136


I have created and populated three tables. The table T1 has a primary key declared, with a couple of extra columns.


Note that the size of t1 is 148 MBytes and the size of t3 is 1575 Mbytes.


Below is the SQL statement we are interested in:


SELECT T1.V1 

  FROM T1

 WHERE (

            $1 = 'SELLER' 

        AND T1.ID IN (SELECT ID FROM T2 WHERE N1 = 0)

        )

     OR      (

            $1 = 'BUYER' 

        AND T1.ID IN (SELECT ID FROM T3 WHERE N1 = 0)

        )

;


If you are familiar with ORACLE, you may think that without re-enginnering the SQL you cannot retrieve the results fast. The following is the execution plan Oracle produces.


---------------------------------------------------------------------------

| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |

---------------------------------------------------------------------------

|   0 | SELECT STATEMENT   |      |    10 |   150 |    26   (4)| 00:00:01 |

|*  1 |  FILTER            |      |       |       |            |          |

|   2 |   TABLE ACCESS FULL| T1   | 10000 |   146K|    26   (4)| 00:00:01 |

|*  3 |   TABLE ACCESS FULL| T2   |     1 |     8 |    26   (4)| 00:00:01 |

|*  4 |   TABLE ACCESS FULL| T3   |     1 |     8 |    26   (4)| 00:00:01 |

---------------------------------------------------------------------------

For every row in the T1 table Oracle has to execute the subquery to see if the row with the matching T1.ID in T2 or T3 has an '0' as the N1 column value. Oracle has to search T2 or T3 1,000,000 times in this case. This does not look efficient. If the cardinality of T1 is small, the above plan is OK. If it is not, the performance of the SQL is catastrophic.


So when I first saw the SQL statement in a PostgreSQL production system, I thought that the performance would be terribly bad and the query would be consuming all CPU load. But when I executed the query in PostgreSQL and observed its execution plan, I was somewhat shocked.


Below is the execution plan when I assigned 'SELLER' to $1.


Seq Scan on t1 (actual time=134.297..293.422 rows=2695 loops=1)
  Filter: (hashed SubPlan 1)
  Rows Removed by Filter: 997305
  Buffers: shared hit=6968 read=29396
  SubPlan 1
    ->  Gather (actual time=0.381..133.138 rows=2695 loops=1)
          Workers Planned: 2
          Workers Launched: 2
          Buffers: shared hit=2208 read=15974
          ->  Parallel Seq Scan on t2 (actual time=0.050..42.941 rows=898 loops=3)
                Filter: (n1 = 0)
                Rows Removed by Filter: 332435
                Buffers: shared hit=2208 read=15974
Planning Time: 0.142 ms
Execution Time: 293.620 ms


Though Filter operation kicked in, how the SQL gets its results is totally different from that of ORACLE. In the execution plan above, we have to pay attention to (hashed Subplan 1)PostgreSQL builds a hash table over the id's from T2 and searches the id's from T1 in the hash table. It is not visiting T2 1,000,000 times - which is what ORACLE does.  In fact, the mechanism of the plan above is similar to Hash Join.


The drawback of the SQL statement in investigation is that the query planner cannot consider other join methods and join orders. We need to change the SQL to something the planner can handle efficiently. Usually when there are subqueries it is a good idea to have the plan collapse a subquery.


The key issue is finding a way of working around the OR clauses that make it impossible for either of the two subqueries to be collapsed into a small driving data set. When you need to optimize queries of this shape you need to rewrite them as UNION ALL queries to separate the two parts of the OR predicate.

As an alternative I manually rewrote the subqueries as a single union all subquery and

I moved the bind variable comparisons inside their respective subquries.


SELECT T1.V1

  FROM T1

 WHERE T1.ID IN (SELECT ID FROM T2 WHERE N1=0 AND $1 = 'SELLER'

                     UNION ALL

                    SELECT ID FROM T3 WHERE N1=0 AND $1 = 'BUYER');


Nested Loop (actual time=144.907..153.328 rows=2695 loops=1)
  Buffers: shared hit=13020 read=15942
  ->  HashAggregate (actual time=144.883..145.214 rows=2695 loops=1)
        Group Key: t2.id
        Batches: 1  Memory Usage: 241kB
        Buffers: shared hit=2240 read=15942
        ->  Gather (actual time=0.364..143.623 rows=2695 loops=1)
              Workers Planned: 2
              Workers Launched: 2
              Buffers: shared hit=2240 read=15942
              ->  Parallel Seq Scan on t2 (actual time=0.049..48.126 rows=898 loops=3)
                    Filter: (n1 = 0)
                    Rows Removed by Filter: 332435
                    Buffers: shared hit=2240 read=15942
  ->  Index Scan using t1_pk on t1 (actual time=0.003..0.003 rows=1 loops=2695)
        Index Cond: (id = t2.id)
        Buffers: shared hit=10780
Planning:
  Buffers: shared hit=8
Planning Time: 0.229 ms
Execution Time: 153.523 ms


We can observe that the subquery collapsed and the planner decided to use a Nested Loop join.


Wrap Up

Filter operation in PostgreSQL can be very efficient with Hashed Subplan.  PostgreSQL doesn't like the OR clause, which constrains the planner's choice of the join order, the join method, and the like. So we have to find ways to give it a little help if we see it running subqueries as filter subqueries when we are expecting it to collapse a subquery to produce a small driving data set.


   

postgresdba.com