When writing SQL queries in PostgreSQL, performance optimization is crucial for handling large datasets efficiently. One common mistake that can impact performance is using the OR operator in the where clause instead of IN.
This article explains why IN is generally more efficient than OR in PostgreSQL and provides execution plan comparisons to demonstrate the difference.
How Plans Are Generated
In PostgreSQL, the optimizer treats IN and OR differently when generating execution plans.
- IN Cluase Optimization:
PostgreSQL translates IN (A, B, C) into an array lookup (ANY operator), which allows it to efficiently walk through an index using a single index scan.
- OR Condition Complexity:
When using multiple OR conditions (e.g., c1=A or c2=B), PostgreSQL often uses a BitmapOr operation, which can introduce additional overhead, especially scanning large datasets.
Furthermore, if Bitmap Scan optimizations are disabled or not applicable, PostgreSQL may fall back to a sequential scan, leading to severe performance degradation.
The following example is tested on PostgreSQL 16.3.
drop table t1, t2;
create table t1 (c1 int primary key, c2 int, c3 char(500));
create table t2 (c1 int primary key, c2 int, c3 char(500));
insert into t1 select i, i, md5(i::text) from generate_series(1, 10000) a(i);
insert into t2 select i, i, md5(i::text) from generate_series(1, 1000000) a(i);
analyze verbose t1, t2;
set max_parallel_workers_per_gather to 0;
This setup creates two tables with primary keys on c1 and populates
them with 10,000 and 1,000,000 rows, respectively.
Below is a query using the IN operator followd by its execution plan.
explain (analyze, buffers)
select *
from t1
where c1 in (100,200);
Index Scan using t1_pkey on t1 (cost=0.29..12.61 rows=2 width=512) (actual time=0.019..0.022 rows=2 loops=1)
Index Cond: (c1 = ANY ('{100,200}'::integer[]))
Buffers: shared hit=6
Planning Time: 0.062 ms
Execution Time: 0.039 ms
We observe that PostgreSQL performs an Index Scan using the primary key (t1_key).
The optimizer efficiently converts IN (100,200) into an array lookup.
The query hits only 6 buffers.
Now let's take a look at how PostgreSQL handles OR.
explain (analyze, buffers)
select *
from t1 a
where c1 = 100
or c1 = 200;
Bitmap Heap Scan on t1 (cost=8.59..16.29 rows=2 width=512) (actual time=0.026..0.029 rows=2 loops=1)
Recheck Cond: ((c1 = 100) OR (c1 = 200))
Heap Blocks: exact=2
Buffers: shared hit=6
-> BitmapOr (cost=8.59..8.59 rows=2 width=0) (actual time=0.019..0.020 rows=0 loops=1)
Buffers: shared hit=4
-> Bitmap Index Scan on t1_pkey (cost=0.00..4.29 rows=1 width=0) (actual time=0.016..0.016 rows=1 loops=1)
Index Cond: (c1 = 100)
Buffers: shared hit=2
-> Bitmap Index Scan on t1_pkey (cost=0.00..4.29 rows=1 width=0) (actual time=0.003..0.003 rows=1 loops=1)
Index Cond: (c1 = 200)
Buffers: shared hit=2
Planning Time: 0.073 ms
Execution Time: 0.063 ms
PostgreSQL chooses a Bitmap Heap Scan instead of an Index Scan.
It uses two seperate Bitmap Index Scans, then combines them using BitmapOr, adding overhead.
And the execution time is slower (0.063 ms vs. 0.039 ms) compared to IN.
While the performance difference is small for this dataset, it can be significant for large tables.
What will happen when Bitmap Heap Scan is disabled?
I employed the pg_hint extension to disable Bitmap Heap Scan.
/+ NoBitmapScan(a) */
explain (analyze, buffers)
select *
from t1 a
where c1 = 100
or c1 = 200;
Seq Scan on t1 a (cost=0.00..817.00 rows=2 width=512) (actual time=3.081..197.940 rows=2 loops=1)
Filter: ((c1 = 100) OR (c1 = 200))
Rows Removed by Filter: 9998
Buffers: shared hit=667
Planning Time: 0.133 ms
Execution Time: 197.975 ms
A Seq Scan is forced when Bitmap Scan is disabled. The query scans all 10,000 rows, removing unnecessary ones, leading to high execution time(197.975 ms). It resulted in 667 buffer hits vs. just 6 buffer hits in the IN clause query.
This confirms that using IN is more efficient than OR, as it avoids full table scans in unfavorable conditions.
When Should We Still Use OR?
There are some cases where OR is necessary.
1. Different columns in the WHERE clause.
SELECT * FROM t1 WHERE c1 = 100 or c2 = 200;
2. Complex expressions that IN cannot replace.
SELECT * FROM t1 WHERE (c1 = 100 and c2 > 50) OR (C1=200 AND C2 < 30);
However, if all conditions involve the same column, IN is almost always the better choice.
Conclusion
1. Use IN instead of OR in the WHERE clause when filtering the same column.
2. IN queries use Index Scans efficiently (if an adequate index exists), while OR may trigger Bitmap Heap Scan or even Seq Scan.
3. The performance gap increases with larger tables and high concurrency environments.
4. By using IN instead of OR, you can significantly improve PostgreSQL query performance and reduce resource consumption.