Recently while tuning an SQL statement I was struck by an oddity of how PostgreSQL handles "between" predicates. In this article, I will provide the oddity I observed and some findings of how PostgreSQL handles "between" predicates.
It's probably common knowledge that if your SQL has lines like this:
col1 beween {A} and {b}
the optimizer will transform them into lines like these:
col1 >= {A}
and col1 <= {B}
One
of the questions that crossed my mind was this: "does the optimizer get
clever about which constant to use first?" Many people - particularly
those with a strong background in Oracle - may say yes without
hesitation. I wanted to check whether PostgreSQL could be as clever as
Oracle.
Here is a little test script that will illustrate the problems with the "BETWEEN" predicate.
DROP TABLE t1;
create table t1(c1 timestamp) with (fillfactor=90);
insert into t1
select generate_series('2000-01-01','2020-12-31',interval '1 minutes');
select min(c1), max(c1) from t1;
ANALYZE T1;
select * from pg_stats where tablename='t1';
set max_parallel_workers_per_gather = 0;
SELECT *
FROM T1
WHERE C1
BETWEEN to_timestamp('20000101','yyyymmdd') AND to_timestamp('20000102','yyyymmdd');
SELECT *
FROM T1
WHERE C1
BETWEEN to_timestamp('20201230','yyyymmdd') AND to_timestamp('20201231','yyyymmdd');
All
I have done is to create a table with a C1 column which has time values
from '2000-01-01' to '2020-12-31' with 1 minute intervals. Then I
have run two simple queries to retrieve data for a day based on the
values of the C1 column - but for different ranges of values.
Now the question is : "does the execution plan change with choice of range?".
The answer is no.
When you examine the plan closely you will not be able to notice a change in the Filter operation. Here are the execution plans produced.
--execution plan of the first query
Seq Scan on t1 (cost=0.00..275035.34 rows=1 width=0) (actual time=0.013..18846.580 rows=1441 loops=1) Filter: ((c1 >= to_timestamp('20000101'::text, 'yyyymmdd'::text)) AND (c1 <= to_timestamp('20000102'::text, 'yyyymmdd'::text))) Rows Removed by Filter: 11043360 Buffers: shared hit=54142Planning: Buffers: shared hit=6Planning Time: 1.710 msExecution Time: 18846.624 ms
--execution plan of the second query
Seq Scan on t1 (cost=0.00..275035.34 rows=589 width=0) (actual time=8851.985..8853.982 rows=1441 loops=1)
Filter: ((c1 >= to_timestamp('20201230'::text, 'yyyymmdd'::text)) AND (c1 <= to_timestamp('20201231'::text, 'yyyymmdd'::text)))
Rows Removed by Filter: 11043360
Buffers: shared hit=54142
Planning Time: 0.107 ms
Execution Time: 8854.030 ms
Note
that the order of the filter predicates has not changed as we move from
one end of the range to the other. Compared to the second execution
plan, the first one is not efficient. In the first query c1 <= to_timestamp('20000102','yyyymmdd') is the condition that is more likely to fail. If the optimizer had done
the test first, then it would have tested the condition c1 >= to_timestamp('20000101','yyyymmdd') just 1441 times. Since it had done the c1 >= to_timestamp('20000101','yyyymmdd') test first, it had done the c1 <= to_timestamp('20000102','yyyymmdd') test 11,043,360 times. If you run those two queries many times, you will notice that the second query is always 2x faster, which is caused by the CPU load. So if you want to get the first query faster, you have to rewrite the query like this:
SELECT *
FROM T1
WHERE C1 <= to_timestamp('20000102','yyyymmdd')
AND C1 >= to_timestamp('20000101','yyyymmdd');
Seq Scan on t1 (cost=0.00..275041.36 rows=1 width=8) (actual time=0.017..9000.382 rows=1441 loops=1)
Filter: ((c1 <= to_timestamp('20000102'::text, 'yyyymmdd'::text))
AND (c1 >= to_timestamp('20000101'::text, 'yyyymmdd'::text)))
Rows Removed by Filter: 11043360
Buffers: shared hit=54142
Planning Time: 0.126 ms
Execution Time: 9000.425 ms
If you do the same experiment in Oracle, the Oracle optimizer makes the first comparison against C1 <= to_timestamp('20000102','yyyymmdd') in the C1 BETWEEN to_timestamp('20000101','yyyymmdd') AND to_timestamp('20000102','yyyymmdd') predicate.
This
oddity in PostgreSQL won't make much difference to CPU usage in most
cases, but in some extreme cases like the example above you may get
different performance.
Conclusion
If your SQL statement has lines like this:
col1 BETWEEN {A} AND {B}
PostgreSQL uses the constant A first whether
or not it is optimal. So if you know your data and you are confident
that using the constant B first is optimal, rewrite the query like this:
col1 <= B and col1 >= A