설문조사
PostgreSQL/PPAS 관련 듣고 싶은 교육은


총 게시물 169건, 최근 0 건
 

Transitive Predicates and its limitations

글쓴이 : 모델광 날짜 : 2024-08-21 (수) 21:21 조회 : 146
The following article promted me to write this note.

https://news.jtbc.co.kr/article/article.aspx?news_id=NB12210948

Aristotle's syllogism is a cornerstone of logical reasoning, forming the basis of deductive reasoning that has influenced many fields from philosophy to database management. When I heard Yun's speech, I was struck by how closely his rhetoric resembled Aristotle's method of reasoning.

To illustrate:
1. Major Premise: This is a general statement or universal truth.
  * The President of Korea works for the security of Korea.

2. Minor Premise: This is a specific statement that relates to the major premise.
  * I am the President of Korea.

3. Conclusion: This is the logical result that follows from the two premises.
  * Therefore, those who oppose me at every corner of this society are anti-state forces.

This syllogism exmplifies how specific conclusions can be derived logically from general principle. Interestingly, PostgreSQL also leverages a form of syllogistic reasoning to optimize query performance, a mechanism known as a transitive predicate.

What is a Transitive Predicate?

In PostgreSQL, transitive predicates are a powerful optimization technique where the database engine infers and applies additional filtering conditioins based on existing ones. The basic logic follows Aristotle's syllogism: If A=B and B=C, then A=C. By applying such logic, PostgreSQL can optimize queries by narrowing down the number of rows processed at  each stage, thus improving performance.

Example: Transitive Predicate in Action
Consider the following example, tested on PostgreSQL 14.5:

create table customer (
cust_no         int                not null,
cust_nm        character varying(100),
register_date  timestamp(0),
register_dt      varchar(8),
cust_status_cd varchar(1),
register_channel_cd varchar(1),
cust_age        int,
active_yn       boolean,
sigungu_cd     varchar(5),
sido_cd          varchar(2)
);
insert into customer
select i, chr(65+mod(i,26))||i::text||'CUST_NM'
     , '2024-06-08'::date - mod(i,10000)
     , to_char(('2024-06-08'::date - mod(i,10000)),'yyyymmdd') as register_dt
     , mod(i,5)+1 as cust_status_cd
     , mod(i,3)+1 as register_channel_cd
     , trunc(random() * 100) +1 as age
     , case when mod(i,22) = 0 then false else true end as active_yn
     , case when mod(i,1000) = 0 then '11007'
            when mod(i,1000) = 1 then '11006'
            when mod(i,1000) = 2 then '11005'
            when mod(i,1000) = 3 then '11004'
            when mod(i,1000) = 4 then '11003'
            when mod(i,1000) = 5 then '11002'
            else '11001' end                  as sigungu_cd
      , case when mod(i,3) = 0 then '01'
             when mod(i,3) = 1 then '02'
             when mod(i,3) = 2 then '03' end as sido_cd
  from generate_series(1,1000000) a(i);

ALTER TABLE customer ADD CONSTRAINT customer_pk  PRIMARY KEY (cust_no);
 
create table online_order (
ord_no int           not null,
cust_no int      not null,
ord_date timestamp(0) not null,
ord_dt   varchar(8)   not null,
ord_status_cd varchar(1) not null,
comment  varchar(100)
);
insert into online_order
select i, mod(i,1000000)+1 as cust_no
      ,'2024-06-08'::date - mod(i,1000) as ord_date
      ,to_char(('2024-06-08'::date - mod(i,1000)),'yyyymmdd') as ord_dt
      ,(mod(i,4) + 1) as ord_status_cd
      ,lpad('x',100,'x')
  from generate_series(1,2000000,2) a(i);

alter table online_order add constraint online_order_pk primary key (ord_no);

Here is a query we have to investigate followed by its execution plan:

 select a.cust_no, a.cust_nm, b.ord_date, ord_status_cd
  from customer a
  join online_order b
    on a.cust_no = b.cust_no
 where a.cust_no = 110;

Nested Loop  (cost=0.42..31739.46 rows=2 width=28) (actual time=0.870..141.338 rows=2 loops=1)
  Buffers: shared hit=16094 read=3141
  ->  Index Scan using customer_pk on customer a  (cost=0.42..8.44 rows=1 width=18) (actual time=0.063..0.067 rows=1 loops=1)
        Index Cond: (cust_no = 110)
        Buffers: shared hit=4
  ->  Seq Scan on online_order b  (cost=0.00..31731.00 rows=2 width=14) (actual time=0.794..141.254 rows=2 loops=1)
        Filter: (cust_no = 110)
        Rows Removed by Filter: 999998
        Buffers: shared hit=16090 read=3141
Planning Time: 0.349 ms
Execution Time: 141.549 ms


Note that the b.cust_no = 110 filter predicate is applied in the Seq Scan on online_order node even though there is no explicit filter condition on cust_no in the online_order table. This is an example of PostgreSQL's optimizer applying a transitive predicate to optimize a query.

Limitations of Transitive Predicates
However, PostgreSQL is not as clever as Yun. The transitive predicate optimization is only applied to equality(=) conditions. It does not apply to IN, BETWEEN, or comparison operators such as <, >.
Consider the following query and its execution plan:

select a.cust_no, a.cust_nm, b.ord_date, ord_status_cd
  from customer a
  join online_order b
    on a.cust_no = b.cust_no
 where a.cust_no in (110,120);
 
Hash Join  (cost=12.91..31868.92 rows=2 width=28) (actual time=0.042..130.034 rows=4 loops=1)
  Hash Cond: (b.cust_no = a.cust_no)
  Buffers: shared hit=16083 read=3155
  ->  Seq Scan on online_order b  (cost=0.00..29231.00 rows=1000000 width=14) (actual time=0.017..62.326 rows=1000000 loops=1)
        Buffers: shared hit=16076 read=3155
  ->  Hash  (cost=12.88..12.88 rows=2 width=18) (actual time=0.012..0.013 rows=2 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 9kB
        Buffers: shared hit=7
        ->  Index Scan using customer_pk on customer a  (cost=0.42..12.88 rows=2 width=18) (actual time=0.007..0.009 rows=2 loops=1)
              Index Cond: (cust_no = ANY ('{110,120}'::integer[]))
              Buffers: shared hit=7
Planning:
  Buffers: shared hit=8
Planning Time: 0.135 ms
Execution Time: 130.073 ms


In this case, the optimizer does not apply the filter predicate cust_no in (110,220) to the online_order table before the join, resulting in performance drop. The 1000000 rows from the online_order table participated in the join.

Improving Query Performance
How will we be able to improve the performance of this query?
We have to rewrite the SQL to manually apply the filter to the online_order table.

select a.cust_no, a.cust_nm, b.ord_date, ord_status_cd
  from customer a
  join online_order b
    on a.cust_no = b.cust_no
 where b.cust_no in (110,120)
   and a.cust_no in (110,120);

Nested Loop  (cost=0.42..31744.01 rows=1 width=28) (actual time=0.305..69.818 rows=4 loops=1)
  Join Filter: (a.cust_no = b.cust_no)
  Rows Removed by Join Filter: 2
  Buffers: shared hit=16095 read=3143
  ->  Seq Scan on online_order b  (cost=0.00..31731.00 rows=4 width=14) (actual time=0.286..69.788 rows=4 loops=1)
        Filter: (cust_no = ANY ('{110,120}'::integer[]))
        Rows Removed by Filter: 999996
        Buffers: shared hit=16088 read=3143
  ->  Materialize  (cost=0.42..12.89 rows=2 width=18) (actual time=0.004..0.005 rows=2 loops=4)
        Buffers: shared hit=7
        ->  Index Scan using customer_pk on customer a  (cost=0.42..12.88 rows=2 width=18) (actual time=0.013..0.015 rows=2 loops=1)
              Index Cond: (cust_no = ANY ('{110,120}'::integer[]))
              Buffers: shared hit=7
Planning:
  Buffers: shared hit=8
Planning Time: 0.157 ms
Execution Time: 69.861 ms


With this rewrtie, only 4 rows from the online_order table participated in the join. As a result, the elapsed time decreased from 130 ms to 69 ms, even though the number of block I/O operations remained unchanged.

Conclusion
PostgreSQL is not as clever as Yun. PostgreSQL's transitive predicate feature is effective for equality(=) conditions but does not apply to other operators like IN, BETWEEN, or comparison operators (e.g., >, <). When writing a query in PostgreSQL, it is essential to be aware of these limitations to avoid potential performance issues.
By understanding and addressing the limitations of transitive predicates in PostgreSQL, you can optimize query performance and ensure efficient data retrieval in your applications.

Footnote
If you are using Oracle, don't worry, be happy. Oracle's transitive predicate handling is more advanced compared to PostgreSQL. In Oracle, the optimizer can apply transitive logic to a wider range of conditions, incluing IN, BETWEEN and inequalities, which can lead to better performance optimization without requiring manual query rewrites.

 

postgresdba.com