Aristotle's syllogism is a cornerstone of logical reasoning, forming the basis of deductive reasoning that has influenced many fields from philosophy to database management. When I heard Yun's speech, I was struck by how closely his rhetoric resembled Aristotle's method of reasoning.
To illustrate: 1. Major Premise: This is a general statement or universal truth. * The President of Korea works for the security of Korea.
2. Minor Premise: This is a specific statement that relates to the major premise. * I am the President of Korea.
3. Conclusion: This is the logical result that follows from the two premises. * Therefore, those who oppose me at every corner of this society are anti-state forces.
This syllogism exmplifies how specific conclusions can be derived logically from general principle. Interestingly, PostgreSQL also leverages a form of syllogistic reasoning to optimize query performance, a mechanism known as a transitive predicate.
What is a Transitive Predicate? In PostgreSQL, transitive predicates are a powerful optimization technique where the database engine infers and applies additional filtering conditioins based on existing ones. The basic logic follows Aristotle's syllogism: If A=B and B=C, then A=C. By applying such logic, PostgreSQL can optimize queries by narrowing down the number of rows processed at each stage, thus improving performance.
Example: Transitive Predicate in Action Consider the following example, tested on PostgreSQL 14.5:
create table customer ( cust_no int not null, cust_nm character varying(100), register_date timestamp(0), register_dt varchar(8), cust_status_cd varchar(1), register_channel_cd varchar(1), cust_age int, active_yn boolean, sigungu_cd varchar(5), sido_cd varchar(2) ); insert into customer select i, chr(65+mod(i,26))||i::text||'CUST_NM' , '2024-06-08'::date - mod(i,10000) , to_char(('2024-06-08'::date - mod(i,10000)),'yyyymmdd') as register_dt , mod(i,5)+1 as cust_status_cd , mod(i,3)+1 as register_channel_cd , trunc(random() * 100) +1 as age , case when mod(i,22) = 0 then false else true end as active_yn , case when mod(i,1000) = 0 then '11007' when mod(i,1000) = 1 then '11006' when mod(i,1000) = 2 then '11005' when mod(i,1000) = 3 then '11004' when mod(i,1000) = 4 then '11003' when mod(i,1000) = 5 then '11002' else '11001' end as sigungu_cd , case when mod(i,3) = 0 then '01' when mod(i,3) = 1 then '02' when mod(i,3) = 2 then '03' end as sido_cd
from generate_series(1,1000000) a(i);
ALTER TABLE customer ADD CONSTRAINT customer_pk PRIMARY KEY (cust_no);
create table online_order ( ord_no int not null, cust_no int not null, ord_date timestamp(0) not null, ord_dt varchar(8) not null, ord_status_cd varchar(1) not null, comment varchar(100) ); insert into online_order select i, mod(i,1000000)+1 as cust_no ,'2024-06-08'::date - mod(i,1000) as ord_date ,to_char(('2024-06-08'::date - mod(i,1000)),'yyyymmdd') as ord_dt ,(mod(i,4) + 1) as ord_status_cd ,lpad('x',100,'x')
from generate_series(1,2000000,2) a(i);
alter table online_order add constraint online_order_pk primary key (ord_no);
Here is a query we have to investigate followed by its execution plan:
select a.cust_no, a.cust_nm, b.ord_date, ord_status_cd from customer a join online_order b on a.cust_no = b.cust_no where a.cust_no = 110;
Nested Loop (cost=0.42..31739.46 rows=2 width=28) (actual time=0.870..141.338 rows=2 loops=1) Buffers: shared hit=16094 read=3141 -> Index Scan using customer_pk on customer a (cost=0.42..8.44 rows=1 width=18) (actual time=0.063..0.067 rows=1 loops=1) Index Cond: (cust_no = 110) Buffers: shared hit=4 -> Seq Scan on online_order b (cost=0.00..31731.00 rows=2 width=14) (actual time=0.794..141.254 rows=2 loops=1) Filter: (cust_no = 110) Rows Removed by Filter: 999998 Buffers: shared hit=16090 read=3141 Planning Time: 0.349 ms Execution Time: 141.549 ms
Note that the b.cust_no = 110 filter predicate is applied in the Seq Scan on online_order node even though there is no explicit filter condition on cust_no in the online_order table. This is an example of PostgreSQL's optimizer applying a transitive predicate to optimize a query.
Limitations of Transitive Predicates However, PostgreSQL is not as clever as Yun. The transitive predicate optimization is only applied to equality(=) conditions. It does not apply to IN, BETWEEN, or comparison operators such as <, >. Consider the following query and its execution plan:
select a.cust_no, a.cust_nm, b.ord_date, ord_status_cd from customer a join online_order b on a.cust_no = b.cust_no where a.cust_no in (110,120);
Hash Join (cost=12.91..31868.92 rows=2 width=28) (actual time=0.042..130.034 rows=4 loops=1) Hash Cond: (b.cust_no = a.cust_no) Buffers: shared hit=16083 read=3155 -> Seq Scan on online_order b (cost=0.00..29231.00 rows=1000000 width=14) (actual time=0.017..62.326 rows=1000000 loops=1) Buffers: shared hit=16076 read=3155 -> Hash (cost=12.88..12.88 rows=2 width=18) (actual time=0.012..0.013 rows=2 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB Buffers: shared hit=7 -> Index Scan using customer_pk on customer a (cost=0.42..12.88 rows=2 width=18) (actual time=0.007..0.009 rows=2 loops=1) Index Cond: (cust_no = ANY ('{110,120}'::integer[])) Buffers: shared hit=7 Planning: Buffers: shared hit=8 Planning Time: 0.135 ms Execution Time: 130.073 ms
In this case, the optimizer does not apply the filter predicate cust_no in (110,220) to the online_order table before the join, resulting in performance drop. The 1000000 rows from the online_order table participated in the join.
Improving Query Performance How will we be able to improve the performance of this query? We have to rewrite the SQL to manually apply the filter to the online_order table.
select a.cust_no, a.cust_nm, b.ord_date, ord_status_cd from customer a join online_order b on a.cust_no = b.cust_no where b.cust_no in (110,120) and a.cust_no in (110,120);
Nested Loop (cost=0.42..31744.01 rows=1 width=28) (actual time=0.305..69.818 rows=4 loops=1) Join Filter: (a.cust_no = b.cust_no) Rows Removed by Join Filter: 2 Buffers: shared hit=16095 read=3143 -> Seq Scan on online_order b (cost=0.00..31731.00 rows=4 width=14) (actual time=0.286..69.788 rows=4 loops=1) Filter: (cust_no = ANY ('{110,120}'::integer[])) Rows Removed by Filter: 999996 Buffers: shared hit=16088 read=3143 -> Materialize (cost=0.42..12.89 rows=2 width=18) (actual time=0.004..0.005 rows=2 loops=4) Buffers: shared hit=7 -> Index Scan using customer_pk on customer a (cost=0.42..12.88 rows=2 width=18) (actual time=0.013..0.015 rows=2 loops=1) Index Cond: (cust_no = ANY ('{110,120}'::integer[])) Buffers: shared hit=7 Planning: Buffers: shared hit=8 Planning Time: 0.157 ms Execution Time: 69.861 ms
With this rewrtie, only 4 rows from the online_order table participated in the join. As a result, the elapsed time decreased from 130 ms to 69 ms, even though the number of block I/O operations remained unchanged.
Conclusion PostgreSQL is not as clever as Yun. PostgreSQL's transitive predicate feature is effective for equality(=) conditions but does not apply to other operators like IN, BETWEEN, or comparison operators (e.g., >, <). When writing a query in PostgreSQL, it is essential to be aware of these limitations to avoid potential performance issues. By understanding and addressing the limitations of transitive predicates in PostgreSQL, you can optimize query performance and ensure efficient data retrieval in your applications.
Footnote If you are using Oracle, don't worry, be happy. Oracle's transitive predicate handling is more advanced compared to PostgreSQL. In Oracle, the optimizer can apply transitive logic to a wider range of conditions, incluing IN, BETWEEN and inequalities, which can lead to better performance optimization without requiring manual query rewrites.