설문조사
PostgreSQL/PPAS 관련 듣고 싶은 교육은


총 게시물 131건, 최근 0 건
   

Distinct Placement

글쓴이 : 모델광 날짜 : 2023-02-09 (목) 21:19 조회 : 173
I am currently working on a project that involves transforming Oracle SQL statements into equivlanet PostgreSQL statements. During my work, I noticed a significant peformance degradation after running a query with the DISTINCT keyword in PostgreSQL.
Writing efficient and optimzed SQL queries requires a deep understanding of the database management system being used and the data being queried. This note is intended to provide you with some tips on how to enhance query performance with the DISTINCT keyword in PostgreSQL. I will start with a brief test code to generate some data which supports my point.


create table com_code (
group_cd  varchar(10),
cd        varchar(10),
cd_nm     varchar(100));
insert into com_code values ('G1','11001','SEOUL')
                           ,('G1','11002','PUSAN')
                           ,('G1','11003','INCHEON')
                           ,('G1','11004','DAEGU')
                           ,('G1','11005','JAEJU')
                           ,('G1','11006','ULEUNG')
                           ,('G1','11007','ETC');
insert into com_code values ('G2','1','Infant')
                           ,('G2','2','Child')
                           ,('G2','3','Adolescent')
                           ,('G2','4','Adult')
                           ,('G2','5','Senior');
insert into com_code values ('G3','01','Jeonbuk')
                           ,('G3','02','Kangwon')
                           ,('G3','03','Chungnam');

alter table com_code add constraint com_code_pk
  primary key (group_cd, cd);

create table customer (
cust_no        numeric not null,
cust_nm        character varying(100),
register_date  timestamp(0),
register_dt    varchar(8),
cust_status_cd varchar(1),
register_channel_cd varchar(1),
cust_age       numeric(3),
active_yn      varchar(1),
sigungu_cd     varchar(5),
sido_cd        varchar(2)
);

insert into customer
select i, chr(65+mod(i,26))||i::text||'CUST_NM'
     , current_date - mod(i,10000)
     , to_char((current_date - mod(i,10000)),'yyyymmdd') as register_dt
     , mod(i,5)+1 as cust_status_cd
     , mod(i,2)+1 as register_channel_cd
     , trunc(random() * 100) +1 as age
     , case when mod(i,22) = 0 then 'N' else 'Y' end as active_yn
     , case when mod(i,1000) = 0 then '11007'
            when mod(i,1000) = 1 then '11006'
            when mod(i,1000) = 2 then '11005'
            when mod(i,1000) = 3 then '11004'
            when mod(i,1000) = 4 then '11003'
            when mod(i,1000) = 5 then '11002'
            else '11001' end                  as sigungu_cd
      , case when mod(i,4) = 0 then '01'
             when mod(i,4) = 1 then '02'
             else                   '03' end as sido_cd
  from generate_series(1,1000000) a(i);
ALTER TABLE customer ADD CONSTRAINT customer_pk
  PRIMARY KEY (cust_no);

Here is the SQL statement in Oracle followed by its execution plan.
I have done this test on Oracle 12.2.


SELECT /+ NO_PX_JOIN_FILTER(B) */
          DISTINCT A.CD_NM, B.SIGUNGU_CD, B.REGISTER_CHANNEL_CD
  FROM COM_CODE A, CUSTOMER B
 WHERE A.CD = B.SIGUNGU_CD
     AND A.GROUP_CD = 'G1'
     AND B.CUST_NO BETWEEN 200000 AND 299999;
-----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                               | Name            | Starts | A-Rows |   A-Time   | Buffers | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                        |                 |      1 |     21 |00:00:00.07 |    1055 |          |
|   1 |  HASH UNIQUE                            |                 |      1 |     21 |00:00:00.07 |    1055 | 1402K (0)|
|*  2 |   HASH JOIN                             |                 |      1 |     21 |00:00:00.07 |    1055 | 1268K (0)|
|   3 |    VIEW                                 | VW_DTP_377C5901 |      1 |      7 |00:00:00.01 |       2 |          |
|   4 |     HASH UNIQUE                         |                 |      1 |      7 |00:00:00.01 |       2 |  911K (0)|
|   5 |      TABLE ACCESS BY INDEX ROWID BATCHED| COM_CODE        |      1 |      7 |00:00:00.01 |       2 |          |
|*  6 |       INDEX RANGE SCAN                  | COM_CODE_PK     |      1 |      7 |00:00:00.01 |       1 |          |
|   7 |    VIEW                                 | VW_DTP_EE607F02 |      1 |     21 |00:00:00.07 |    1053 |          |
|   8 |     HASH UNIQUE                         |                 |      1 |     21 |00:00:00.07 |    1053 | 1481K (0)|
|   9 |      TABLE ACCESS BY INDEX ROWID BATCHED| CUSTOMER        |      1 |    100K|00:00:00.05 |    1053 |          |
|* 10 |       INDEX RANGE SCAN                  | CUSTOMER_PK     |      1 |    100K|00:00:00.02 |     225 |          |
-----------------------------------------------------------------------------------------------------------------------
From the execution plan above, we can infer that Oracle tranformed the query into the following:

SELECT DISTINCT CD_NM, SIGUNGU_CD, REGISTER_CHANNEL_CD
  FROM (SELECT DISTINCT CD, CD_NM
           FROM COM_CODE
          WHERE GROUP_CD = 'G1') VW_DTP_377C5901
       , (SELECT DISTINCT SIGUNGU_CD, REGISTER_CHANNEL_CD
           FROM CUSTOMER
          WHERE CUST_NO BETWEEN 200000 AND 299999) VW_DTP_EE607F02
 WHERE VW_DTP_377C5901.CD = VW_DTP_EE607F02.SIGUNGU_CD;

What an beautiful optimizer it is! Oracle attempts to create an inline view for each table and then performs the DISTINCT operation on each set. This helps to reduce the number of rows that need to be joined.

However, PostgreSQL doesn't have such a mechanism. It runs the query without transforming it. Therefore, we have to create an optimal query for PostgreSQL.
Let's see how PostgreSQL handles the following query:


SELECT DISTINCT A.CD_NM, B.SIGUNGU_CD, B.REGISTER_CHANNEL_CD
  FROM COM_CODE A, CUSTOMER B
 WHERE A.CD = B.SIGUNGU_CD
     AND A.GROUP_CD = 'G1'
     AND B.CUST_NO BETWEEN 200000 AND 299999;

Here is the execution plan PostgreSQL generates:
I have done this test on PostgreSQL 15.1.


HashAggregate  (cost=4625.91..4626.12 rows=21 width=226) (actual time=48.669..48.673 rows=21 loops=1)
  Group Key: a.cd_nm, b.sigungu_cd, b.register_channel_cd
  Batches: 1  Memory Usage: 24kB
  Buffers: shared hit=1415
  ->  Hash Join  (cost=1.62..4575.94 rows=6663 width=226) (actual time=0.028..30.693 rows=100000 loops=1)
        Hash Cond: ((b.sigungu_cd)::text = (a.cd)::text)
        Buffers: shared hit=1415
        ->  Index Scan using customer_pk on customer b  (cost=0.42..4238.27 rows=99942 width=8) (actual time=0.015..12.407 rows=100000 loops=1)
              Index Cond: ((cust_no >= 200000) AND (cust_no <= 299999))
              Buffers: shared hit=1414
        ->  Hash  (cost=1.19..1.19 rows=1 width=256) (actual time=0.009..0.010 rows=7 loops=1)
              Buckets: 1024  Batches: 1  Memory Usage: 9kB
              Buffers: shared hit=1
              ->  Seq Scan on com_code a  (cost=0.00..1.19 rows=1 width=256) (actual time=0.005..0.007 rows=7 loops=1)
                    Filter: ((group_cd)::text = 'G1'::text)
                    Rows Removed by Filter: 8
                    Buffers: shared hit=1
Planning Time: 0.114 ms
Execution Time: 48.701 ms

Note that 99942 rows from the CUSTOMER table participated in the join operation, resulting in 100,000 rows. And PostgreSQL performed the DISTINCT operation on the 100,000 rows resulting in 21 rows.
Let's rewrite the query and see the execution plan.


SELECT A.CD_NM, B.SIGUNGU_CD, B.REGISTER_CHANNEL_CD
   FROM COM_CODE A
        , (SELECT DISTINCT SIGUNGU_CD, REGISTER_CHANNEL_CD
             FROM CUSTOMER B
             WHERE CUST_NO BETWEEN 200000 AND 299999) B
 WHERE A.CD = B.SIGUNGU_CD
     AND A.GROUP_CD = 'G1';

The rewritten query is a little different from that transformed by the Oracle optimizer. In the above query the relationship between the A set and the B set is one-to-many. Therefore, we do not have to perform the DISTINCT operation on the A set. The Oracle optimizer is not always the most efficient. We need to write more optimal queries.

Here is the execution plan PostgreSQL produces:


Hash Join  (cost=4739.18..4739.66 rows=1 width=226) (actual time=31.542..31.550 rows=21 loops=1)
  Hash Cond: ((b.sigungu_cd)::text = (a.cd)::text)
  Buffers: shared hit=1415
  ->  HashAggregate  (cost=4737.98..4738.19 rows=21 width=8) (actual time=31.524..31.527 rows=21 loops=1)
        Group Key: b.sigungu_cd, b.register_channel_cd
        Batches: 1  Memory Usage: 24kB
        Buffers: shared hit=1414
        ->  Index Scan using customer_pk on customer b  (cost=0.42..4238.27 rows=99942 width=8) (actual time=0.013..13.583 rows=100000 loops=1)
              Index Cond: ((cust_no >= 200000) AND (cust_no <= 299999))
              Buffers: shared hit=1414
  ->  Hash  (cost=1.19..1.19 rows=1 width=256) (actual time=0.014..0.015 rows=7 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 9kB
        Buffers: shared hit=1
        ->  Seq Scan on com_code a  (cost=0.00..1.19 rows=1 width=256) (actual time=0.009..0.011 rows=7 loops=1)
              Filter: ((group_cd)::text = 'G1'::text)
              Rows Removed by Filter: 8
              Buffers: shared hit=1
Planning Time: 0.128 ms
Execution Time: 31.580 ms


By rewriting the query we were able to decrease the number of rows participating in the join. The 7 rows from the COM_CODE table were joined to 21 rows from the CUSTOMER table, resuling in lower CPU usage. If you perform the test repeatedly, you can see that the rewritten query runs always faster.

Conclusion
Keep in mind that the optimal placement of the DISTINCT keyword can impact the performance of a query. When placed in the main part of the query the DISTINCT operation is performed after the join, but if the DISTINCT operation is performed on each set before joining, it can lower the CPU load as the number of rows participating in the join decreases. In this sense, it is critical to be mindful when converting an Oracle SQL statement into a PostgreSQL equivalent, as the execution plans may be different.


   

postgresdba.com