PostgreSQL carefully optimizes complex multi-join queries to avoid expensive block I/O and disk I/O. PostgreSQL documents say that regarding a hash join the smaller table should be the build table and the bigger table should be the probe table in order to get the response time short. We can confirm the above statement with ease by doing some simple experiments.
By the way, how will the planner determine which table of the joined two tables is bigger? It utilizes data statistics it gathered by analyzing the tables. When there are complex predicates in the query or there are multi table joins, however, it's difficult for the planer to estimate the cardinality correctly, which results in a bad execution plan. In that case, app developers find themselves in the dark regarding how to get the elapsed time short. One of the serious "defects" of PostgreSQL is it doesn't provide "hints", which are kind of directives to the planner on how to implement the query. Actually, when the planner makes an obvious wrong decision, we cannot do anything. All we can do is to tweak the troubled SQL statement. In most cases we cannot achieve our goal (to get the less block I/O). The only way of working around this defect is to use pg_hint_plan extension.
In this note I will demonstrate the statement, "the smaller table should be the build table to get the response time short." And I'll also show you how to control execution plans with hinting phrases of pg_hint_plan extension.
The following is a script to create test tables.
drop table t1;
drop table t2;
drop table t3;
drop table t4;
create table t1
as
select i as id, rpad(chr(64+mod(i,26))::text,20,'x') as col2 from generate_series(1,10000000) a(i);
create table t2 as select * from t1 order by id limit 1000000;
create table t3 as select * from t1 order by id limit 10000;
create table t4 as select * from t1 order by id limit 1000;
analyze t1;
analyze t2;
analyze t3;
analyze t4;
select relname, reltuples, relpages, 8*relpages as size_kb
from pg_class
where relname in ('t1','t2','t3','t4');
relname | reltuples | relpages | size_kb
---------+-----------+----------+---------
t1 | 1e+07 | 73530 | 574453
t2 | 1e+06 | 7353 | 57445
t3 | 10000 | 74 | 578
t4 | 1000 | 8 | 62
The following shows a four table join query with different hints. You will be able to get the hang of how to use "hints" of pg_hint_plan extension. If you pay attention to the size of tables, you will also be able to notice that small sets should be processed first in order to get the short execution time.
/+ leading((a (b (c d)))) HashJoin(c d) HashJoin(b c d) HashJoin(a b c d) */
explain (analyze, buffers, costs off)
select count(a.col2), count(b.col2), count(c.col2), count(d.col2)
from t1 a, t2 b, t3 c, t4 d
where a.id = b.id
and b.id = c.id
and c.id = d.id;
Aggregate (actual time=747.931..749.591 rows=1 loops=1)
Buffers: shared hit=6587 read=74739
-> Gather (actual time=684.554..749.483 rows=1000 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=6587 read=74739
-> Parallel Hash Join (actual time=681.129..744.130 rows=333 loops=3) --3rd result set
Hash Cond: (a.id = b.id)
Buffers: shared hit=6586 read=74739
-> Parallel Seq Scan on t1 a (actual time=0.008..292.410 rows=3333333 loops=3)
Buffers: shared hit=3264 read=70266
-> Parallel Hash (actual time=65.596..65.600 rows=333 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 168kB
Buffers: shared hit=3126 read=4473
-> Hash Join (actual time=23.401..65.419 rows=333 loops=3) --2nd result set
Hash Cond: (b.id = c.id)
Buffers: shared hit=3126 read=4473
-> Parallel Seq Scan on t2 b (actual time=0.038..28.868 rows=333333 loops=3)
Buffers: shared hit=2880 read=4473
-> Hash (actual time=3.715..3.717 rows=1000 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 94kB
Buffers: shared hit=246
-> Hash Join (actual time=0.454..3.418 rows=1000 loops=3) --1st result set
Hash Cond: (c.id = d.id)
Buffers: shared hit=246
-> Seq Scan on t3 c (actual time=0.010..1.194 rows=10000 loops=3)
Buffers: shared hit=222
-> Hash (actual time=0.423..0.424 rows=1000 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 67kB
Buffers: shared hit=24
-> Seq Scan on t4 d (actual time=0.021..0.198 rows=1000 loops=3)
Buffers: shared hit=24
Planning Time: 0.516 ms
Execution Time: 749.648 ms
The above plan is called the "right-deep query plan".
The join order is : t4 -> t3 -> t2 -> t1
Even though I hinted in the SQL statement for the demonstration purpose, the above execution plan is the plan the optimizer produces without a hint. Compared with the plan of the left deep tree, this plan of the right deep tree exploits less hash table memory. As soon as the first result set is gathered, it is used as a hash table for the next join.
The following is a quick description of the plan.
* Internally the optimizer checks for the existence of any rows in t1.
You can see "actual time=0.008" beside the " -> Parallel Seq Scan on t1 a "
If it doesn't find a single row, it doesn't do anything.
* If the optimizer finds a row in t1, it checks for the existence of any rows in t2.
* If the optimizer finds a row in t2, it checks for the existence of any rows in t3
* If the optimizer finds a row in t3, then it does the following.
* We build a hash table from t4 and probe it with t3 to create a first result set
* As this result set is generated we build a new hash table from it
* As the result set completes we discard the hash table from t4
* We probe the result set with t2 to create a second result set (when reading t2, parallelism kicked in)
* As the second result set is generated we build a new hash table from it
* As the second result set completes we discard the hash table from the first result set
* We probe the second result set with t1 to create a third result set
* As the third result set is generated we pass the results up to the Aggregate step to count the output.
(the gather node gathered the results of two worker processes before passing the results)
Notice that the number of in-memory hash tables we have in the execution plan after the first join starts is two. And no matter how many tables are involved in this pattern the number of in-memory hash tables will be always two. The actual size of the two hash tables is a little unpredictable, though, and, as a very crude guideline, you might expect the size to grow as more tables are joined into the result set.
In the execution plan above, the sizes of the hash tables were 67 kB, 94 kB and 168 kB respectively.
/+ leading((((a b) c) d)))) HashJoin(a b) HashJoin(a b c) HashJoin(a b c d) */
explain (analyze, buffers, costs off)
select count(a.col2), count(b.col2), count(c.col2), count(d.col2)
from t1 a, t2 b, t3 c, t4 d
where a.id = b.id
and b.id = c.id
and c.id = d.id;
Aggregate (actual time=2407.538..2474.817 rows=1 loops=1)
Buffers: shared hit=3035 read=78291, temp read=60797 written=61132
-> Gather (actual time=1575.259..2474.578 rows=1000 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=3035 read=78291, temp read=60797 written=61132
-> Hash Join (actual time=1567.938..2388.345 rows=333 loops=3)
Hash Cond: (a.id = d.id)
Buffers: shared hit=3034 read=78291, temp read=60797 written=61132
-> Hash Join (actual time=1567.176..2386.749 rows=3333 loops=3)
Hash Cond: (a.id = c.id)
Buffers: shared hit=2814 read=78291, temp read=60797 written=61132
-> Parallel Hash Join (actual time=1561.972..2322.007 rows=333333 loops=3)
Hash Cond: (a.id = b.id)
Buffers: shared hit=2592 read=78291, temp read=60797 written=61132
-> Parallel Seq Scan on t1 a (actual time=0.412..812.159 rows=3333333 loops=3)
Buffers: shared hit=2304 read=71226
-> Parallel Hash (actual time=156.858..156.859 rows=333333 loops=3) -- hash table 3
Buckets: 65536 Batches: 32 Memory Usage: 2528kB
Buffers: shared hit=288 read=7065, temp written=5908
-> Parallel Seq Scan on t2 b (actual time=0.210..76.429 rows=333333 loops=3)
Buffers: shared hit=288 read=7065
-> Hash (actual time=5.127..5.127 rows=10000 loops=3) -- hash table 2
Buckets: 16384 Batches: 1 Memory Usage: 714kB
Buffers: shared hit=222
-> Seq Scan on t3 c (actual time=0.020..2.044 rows=10000 loops=3)
Buffers: shared hit=222
-> Hash (actual time=0.507..0.507 rows=1000 loops=3) -- hash table 1
Buckets: 1024 Batches: 1 Memory Usage: 67kB
Buffers: shared hit=24
-> Seq Scan on t4 d (actual time=0.030..0.248 rows=1000 loops=3)
Buffers: shared hit=24
Planning Time: 0.611 ms
Execution Time: 2474.967 ms
The above plan is called the "left deep query plan".
The join order is: t4 -> t3 -> t2 -> t1
* I will not explain parallelism here.
* We build a hash table from t4
* We build a hash table from t3
* We build a hash table from t2
* We pick a row from t1 and probe the t2 hash table
* If the row survives we probe the t3 hash table
* If the row survives we probe the t4 hash table
* If the row survives we pass it up to the Aggregate step to be counted.
In this case the number of simultaneous in-memory hash tables we end up with is three.
We can predict the approximate size of each hash table because it is based on the data we expect to extract from the corresponding real table. If you have enough memory to hold all the hash tables in memory at once (if none of them spill to disk) you will find that this join pattern is likely to be faster. Unfortunately in this test, the size of t2 is 57 MB, and it spilled to disk. When I increased the size of work_mem and re-tested, the elapsed time dropped to 1702 ms without spilling to disk.
The key drawback of this execution plan is that by joining the largest tables first the data of the largest table "flows" through all the remaining joins.
For completeness I also list all the different patterns of hints with their execution plans, stripped to the minimum output:
/+ leading(((a (b c)) d)) HashJoin(b c) HashJoin(a b c) HashJoin(a b c d) */
explain (analyze, buffers, costs off)
select count(a.col2), count(b.col2), count(c.col2), count(d.col2)
from t1 a, t2 b, t3 c, t4 d
where a.id = b.id
and b.id = c.id
and c.id = d.id;
Aggregate (actual time=880.112..902.555 rows=1 loops=1)
Buffers: shared hit=2715 read=78611
-> Gather (actual time=82.815..902.469 rows=1000 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=2715 read=78611
-> Hash Join (actual time=76.718..866.870 rows=333 loops=3)
Hash Cond: (a.id = d.id)
Buffers: shared hit=2714 read=78611
-> Parallel Hash Join (actual time=76.391..866.258 rows=3333 loops=3)
Hash Cond: (a.id = b.id)
Buffers: shared hit=2494 read=78611
-> Parallel Seq Scan on t1 a (actual time=0.254..485.078 rows=3333333 loops=3)
Buffers: shared hit=2176 read=71354
-> Parallel Hash (actual time=75.938..75.940 rows=3333 loops=3)
Buckets: 16384 Batches: 1 Memory Usage: 1056kB
Buffers: shared hit=318 read=7257
-> Hash Join (actual time=2.604..75.161 rows=3333 loops=3)
Hash Cond: (b.id = c.id)
Buffers: shared hit=318 read=7257
-> Parallel Seq Scan on t2 b (actual time=0.210..39.746 rows=333333 loops=3)
Buffers: shared hit=96 read=7257
-> Hash (actual time=2.346..2.347 rows=10000 loops=3)
Buckets: 16384 Batches: 1 Memory Usage: 714kB
Buffers: shared hit=222
-> Seq Scan on t3 c (actual time=0.009..0.946 rows=10000 loops=3)
Buffers: shared hit=222
-> Hash (actual time=0.218..0.219 rows=1000 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 67kB
Buffers: shared hit=24
-> Seq Scan on t4 d (actual time=0.010..0.104 rows=1000 loops=3)
Buffers: shared hit=24
Planning Time: 0.376 ms
Execution Time: 902.836 ms
The join order is : t4 -> t3 -> t2 -> t1
/+ leading(((a b) (c d))) HashJoin(c d) HashJoin(a b) HashJoin(a b c d) */
explain (analyze, buffers, costs off)
select count(a.col2), count(b.col2), count(c.col2), count(d.col2)
from t1 a, t2 b, t3 c, t4 d
where a.id = b.id
and b.id = c.id
and c.id = d.id;
Aggregate (actual time=1788.654..1848.705 rows=1 loops=1)
Buffers: shared hit=3131 read=78195, temp read=64423 written=64720
-> Gather (actual time=1188.720..1848.548 rows=1000 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=3131 read=78195, temp read=64423 written=64720
-> Hash Join (actual time=1171.587..1770.325 rows=333 loops=3)
Hash Cond: (a.id = c.id)
Buffers: shared hit=3130 read=78195, temp read=64423 written=64720
-> Parallel Hash Join (actual time=1168.894..1740.376 rows=333333 loops=3)
Hash Cond: (a.id = b.id)
Buffers: shared hit=2688 read=78195, temp read=64423 written=64720
-> Parallel Seq Scan on t1 a (actual time=0.051..525.572 rows=3333333 loops=3)
Buffers: shared hit=2400 read=71130
-> Parallel Hash (actual time=123.190..123.190 rows=333333 loops=3)
Buckets: 65536 Batches: 32 Memory Usage: 2496kB
Buffers: shared hit=288 read=7065, temp written=5884
-> Parallel Seq Scan on t2 b (actual time=1.484..61.028 rows=333333 loops=3)
Buffers: shared hit=288 read=7065&am