설문조사
PostgreSQL/PPAS 관련 듣고 싶은 교육은


총 게시물 162건, 최근 0 건
   

defects of handling the BETWEEN predicate

글쓴이 : 모델광 날짜 : 2022-09-03 (토) 07:06 조회 : 935

Recently while tuning an SQL statement I was struck by an oddity of how PostgreSQL handles "between" predicates. In this article, I will provide the oddity I observed and some findings of how PostgreSQL handles "between" predicates.


It's probably common knowledge that if your SQL has lines like this:


col1 beween {A} and {b}


the optimizer will transform them into lines like these:


     col1 >= {A}

and col1 <= {B}


One of the questions that crossed my mind was this: "does the optimizer get clever about which constant to use first?"  Many people - particularly those with a strong background in Oracle - may say yes without hesitation. I wanted to check whether PostgreSQL could be as clever as Oracle.


Here is a little test script that will illustrate the problems with the "BETWEEN" predicate.


DROP TABLE t1;

create table t1(c1 timestamp) with (fillfactor=90);
insert into t1
select generate_series('2000-01-01','2020-12-31',interval '1 minutes');
select min(c1), max(c1) from t1;


ANALYZE T1;


select * from pg_stats where tablename='t1';
set max_parallel_workers_per_gather = 0;


SELECT *
  FROM T1
 WHERE C1 BETWEEN to_timestamp('20000101','yyyymmdd') AND to_timestamp('20000102','yyyymmdd');



SELECT *
  FROM T1
 WHERE C1 BETWEEN to_timestamp('20201230','yyyymmdd') AND to_timestamp('20201231','yyyymmdd');


All I have done is to create a table with a C1 column which has time values from '2000-01-01'  to  '2020-12-31' with 1 minute intervals. Then I have run two simple queries to retrieve data for a day based on the values of the C1 column - but for different ranges of values.

Now the question is : "does the execution plan change with choice of range?".
The answer is no.


When you examine the plan closely you will not be able to notice a change in the Filter operation.  Here are the execution plans produced.


--execution plan of the first query


Seq Scan on t1  (cost=0.00..275035.34 rows=1 width=0) (actual time=0.013..18846.580 rows=1441 loops=1)
  Filter: ((c1 >= to_timestamp('20000101'::text, 'yyyymmdd'::text)) AND (c1 <= to_timestamp('20000102'::text, 'yyyymmdd'::text)))
  Rows Removed by Filter: 11043360
  Buffers: shared hit=54142
Planning:
  Buffers: shared hit=6
Planning Time: 1.710 ms
Execution Time: 18846.624 ms


--execution plan of the second query


Seq Scan on t1  (cost=0.00..275035.34 rows=589 width=0) (actual time=8851.985..8853.982 rows=1441 loops=1)
  Filter: ((c1 >= to_timestamp('20201230'::text, 'yyyymmdd'::text)) AND (c1 <= to_timestamp('20201231'::text, 'yyyymmdd'::text)))
  Rows Removed by Filter: 11043360
  Buffers: shared hit=54142
Planning Time: 0.107 ms
Execution Time: 8854.030 ms


Note that the order of the filter predicates has not changed as we move from one end of the range to the other. Compared to the second execution plan, the first one is not efficient. In the first query c1 <= to_timestamp('20000102','yyyymmdd') is the condition that is more likely to fail. If the optimizer had done the test first, then it would have tested the condition c1 >= to_timestamp('20000101','yyyymmdd') just 1441 times. Since it had done the c1 >= to_timestamp('20000101','yyyymmdd') test first, it had done the c1 <= to_timestamp('20000102','yyyymmdd') test 11,043,360 times. If you run those two queries many times, you will notice that the second query is always 2x faster, which is caused by the CPU load. So if you want to get the first query faster, you have to rewrite the query like this:


SELECT *
  FROM T1
 WHERE C1 <= to_timestamp('20000102','yyyymmdd')
     AND C1 >= to_timestamp('20000101','yyyymmdd');

 

Seq Scan on t1  (cost=0.00..275041.36 rows=1 width=8) (actual time=0.017..9000.382 rows=1441 loops=1)
  Filter: ((c1 <= to_timestamp('20000102'::text, 'yyyymmdd'::text)) AND (c1 >= to_timestamp('20000101'::text, 'yyyymmdd'::text)))
  Rows Removed by Filter: 11043360
  Buffers: shared hit=54142
Planning Time: 0.126 ms
Execution Time: 9000.425 ms


If you do the same experiment in Oracle, the Oracle optimizer makes the first comparison against C1 <= to_timestamp('20000102','yyyymmdd') in the C1 BETWEEN to_timestamp('20000101','yyyymmdd') AND to_timestamp('20000102','yyyymmdd') predicate.

This oddity in PostgreSQL won't make much difference to CPU usage in most cases, but in some extreme cases like the example above you may get different performance.


Conclusion

If your SQL statement has lines like this:


col1 BETWEEN {A} AND {B}


PostgreSQL uses the constant A first whether or not it is optimal. So if you know your data and you are confident that using the constant B first is optimal, rewrite the query like this:

col1 <= B and col1 >= A


   

postgresdba.com