PostgreSQL 은 특별합니다.
단순한 데이터베이스가 아니라, 하나의 어플리케이션 플랫폼이기 때문입니다.

공지사항

뉴스 및 기사

	총 게시물 176건, 최근 0 건

Multi-join Queries via Hashing 2

글쓴이 : 모델광 날짜 : 2021-07-27 (화) 23:51 조회 : 3328

In the previous post titled 'Multi-join Queries via Hashing', I said that in a multi-join query via hashing the right-deep query plan with the small tables joining first can be more efficient than their bushy or left-deep query plan.

After reading the post you may raise an automatic question, "what if the small tables don't have joining columns between them so we have to join a large table and a small table first? Which strategy should we take? The right-deep query plan? or the left-deep query plan?

In this note, I will give a clue on how to improve performance in a Data Warehouse system and provide some tips on how to do a thought experiment.

The following is the query from the previous post followed by its execution plan.

/+ leading((a (b (c d)))) HashJoin(c d) HashJoin(b c d) HashJoin(a b c d) */

explain (analyze, buffers, costs off)

select count(a.col2), count(b.col2), count(c.col2), count(d.col2)

from t1 a, t2 b, t3 c, t4 d

where a.id = b.id

and b.id = c.id

and c.id = d.id;

Aggregate (actual time=747.931..749.591 rows=1 loops=1)

Buffers: shared hit=6587 read=74739

-> Gather (actual time=684.554..749.483 rows=1000 loops=1)

Workers Planned: 2

Workers Launched: 2

Buffers: shared hit=6587 read=74739

-> Parallel Hash Join (actual time=681.129..744.130 rows=333 loops=3) --3rd result set

Hash Cond: (a.id = b.id)

Buffers: shared hit=6586 read=74739

-> Parallel Seq Scan on t1 a (actual time=0.008..292.410 rows=3333333 loops=3)

Buffers: shared hit=3264 read=70266

-> Parallel Hash (actual time=65.596..65.600 rows=333 loops=3)

Buckets: 1024 Batches: 1 Memory Usage: 168kB

Buffers: shared hit=3126 read=4473

-> Hash Join (actual time=23.401..65.419 rows=333 loops=3) --2nd result set

Hash Cond: (b.id = c.id)

Buffers: shared hit=3126 read=4473

-> Parallel Seq Scan on t2 b (actual time=0.038..28.868 rows=333333 loops=3)

Buffers: shared hit=2880 read=4473

-> Hash (actual time=3.715..3.717 rows=1000 loops=3)

Buckets: 1024 Batches: 1 Memory Usage: 94kB

Buffers: shared hit=246

-> Hash Join (actual time=0.454..3.418 rows=1000 loops=3) --1st result set

Hash Cond: (c.id = d.id)

Buffers: shared hit=246

-> Seq Scan on t3 c (actual time=0.010..1.194 rows=10000 loops=3)

Buffers: shared hit=222

-> Hash (actual time=0.423..0.424 rows=1000 loops=3)

Buckets: 1024 Batches: 1 Memory Usage: 67kB

Buffers: shared hit=24

-> Seq Scan on t4 d (actual time=0.021..0.198 rows=1000 loops=3)

Buffers: shared hit=24

Planning Time: 0.516 ms

Execution Time: 749.648 ms

In this example the size of t3 and t4 was small. So joining the small two tables first was a good idea. What would we do if t3 and t4 could not be joined?

As a thought experiment, consider three small dimension tables and one big fact table in a DW system. If PostgreSQL were to create a tiny in-memory hash table from the first dimension table and probe it with the large fact table (following the pattern of the right-deep plan above with t3 in the role of the fact table), PostgreSQL would then have to build a very large in-memory hash table before probing it with the second dimension table, and as that second join takes place it would be generating a new result set that would become the next big in-memory hash table. We can infer that we cannot get good performance with this right-deep plan strategy because the build hash table is getting bigger.

For further thought experiment, I'd like to look at the left-deep execution plan from the previous post.

/+ leading((((a b) c) d)))) HashJoin(a b) HashJoin(a b c) HashJoin(a b c d) */

explain (analyze, buffers, costs off)

select count(a.col2), count(b.col2), count(c.col2), count(d.col2)

from t1 a, t2 b, t3 c, t4 d

where a.id = b.id

and b.id = c.id

and c.id = d.id;

Aggregate (actual time=2407.538..2474.817 rows=1 loops=1)

Buffers: shared hit=3035 read=78291, temp read=60797 written=61132

-> Gather (actual time=1575.259..2474.578 rows=1000 loops=1)

Workers Planned: 2

Workers Launched: 2

Buffers: shared hit=3035 read=78291, temp read=60797 written=61132

-> Hash Join (actual time=1567.938..2388.345 rows=333 loops=3)

Hash Cond: (a.id = d.id)

Buffers: shared hit=3034 read=78291, temp read=60797 written=61132

-> Hash Join (actual time=1567.176..2386.749 rows=3333 loops=3)

Hash Cond: (a.id = c.id)

Buffers: shared hit=2814 read=78291, temp read=60797 written=61132

-> Parallel Hash Join (actual time=1561.972..2322.007 rows=333333 loops=3)

Hash Cond: (a.id = b.id)

Buffers: shared hit=2592 read=78291, temp read=60797 written=61132

-> Parallel Seq Scan on t1 a (actual time=0.412..812.159 rows=3333333 loops=3)

Buffers: shared hit=2304 read=71226

-> Parallel Hash (actual time=156.858..156.859 rows=333333 loops=3) -- hash table 3

Buckets: 65536 Batches: 32 Memory Usage: 2528kB

Buffers: shared hit=288 read=7065, temp written=5908

-> Parallel Seq Scan on t2 b (actual time=0.210..76.429 rows=333333 loops=3)

Buffers: shared hit=288 read=7065

-> Hash (actual time=5.127..5.127 rows=10000 loops=3) -- hash table 2

Buckets: 16384 Batches: 1 Memory Usage: 714kB

Buffers: shared hit=222

-> Seq Scan on t3 c (actual time=0.020..2.044 rows=10000 loops=3)

Buffers: shared hit=222

-> Hash (actual time=0.507..0.507 rows=1000 loops=3) -- hash table 1

Buckets: 1024 Batches: 1 Memory Usage: 67kB

Buffers: shared hit=24

-> Seq Scan on t4 d (actual time=0.030..0.248 rows=1000 loops=3)

Buffers: shared hit=24

Planning Time: 0.611 ms

Execution Time: 2474.967 ms

Conversely if PostgreSQL were to create in-memory hash tables from the three dimension tables and then start scanning the large fact table(following the pattern of the left-deep plan above with t1 in the role of the fact table) probing each of the hash join results in turn, it could deliver the resulting rows very quickly without requiring more memory for hash tables to store intermediate results.

Here is a script to turn the thought experiment into a concrete example. Be aware of the deliberately silly hints.

drop table t1;

drop table t2;

drop table t3;

drop table t4;

create table t1

select i as id

, i::text as col1

, rpad(chr(64+mod(i,26))::text,100,'x') as col2

from generate_series(1,200) a(i);

create table t2 as select * from t1;

create table t3 as select * from t1;

create table t4

select t1.id as dim_key1

, t2.id as dim_key2

, t3.id as dim_key3

, 'abcdefghijk' as dummy

, rpad('x',100) as padding

from t1, t2, t3

;

select relname, reltuples, relpages, relpages*8000/1024 as size_kb

from pg_class

where relname in ('t1','t2','t3','t4')

order by relname;

relname | reltuples | relpages | size_kb

---------+-----------+----------+---------

t1 | 200 | 4 | 31

t2 | 200 | 4 | 31

t3 | 200 | 4 | 31

t4 | 8e+06 | 153847 | 1201929

All I have done is to create three small "dimension" tables of 200 rows each then create a "fact" table which is their Cartesian join.

Then I ran a simple query to join the three dimension tables to the fact table.

/+ leading((((d a) b) c)) HashJoin(d a) */ --left-deep tree

explain (analyze, buffers, costs off)

select count(a.col2), count(b.col2), count(c.col2), count(d.padding)

from t1 a, t2 b, t3 c, t4 d

where a.id = d.dim_key1

and b.id = d.dim_key2

and c.id = d.dim_key3;

Below is the execution plan.

Finalize Aggregate (actual time=3003.011..3004.424 rows=1 loops=1)

Buffers: shared hit=2602 read=151447

-> Gather (actual time=3002.817..3004.416 rows=3 loops=1)

Workers Planned: 2

Workers Launched: 2

Buffers: shared hit=2602 read=151447

-> Partial Aggregate (actual time=2998.429..2998.435 rows=1 loops=3)

Buffers: shared hit=2602 read=151447

-> Hash Join (actual time=0.356..2699.961 rows=2666667 loops=3)

Hash Cond: (d.dim_key3 = c.id)

Buffers: shared hit=2602 read=151447

-> Hash Join (actual time=0.170..2086.527 rows=2666667 loops=3)

Hash Cond: (d.dim_key2 = b.id)

Buffers: shared hit=2424 read=151447

-> Hash Join (actual time=0.091..1486.358 rows=2666667 loops=3)

Hash Cond: (d.dim_key1 = a.id)

Buffers: shared hit=2412 read=151447

-> Parallel Seq Scan on t4 d (actual time=0.015..682.546 rows=2666667 lps=3)

Buffers: shared hit=2400 read=151447

-> Hash (actual time=0.066..0.067 rows=200 loops=3)

Buckets: 1024 Batches: 1 Memory Usage: 36kB

Buffers: shared hit=12

-> Seq Scan on t1 a (actual time=0.007..0.031 rows=200 loops=3)

Buffers: shared hit=12

-> Hash (actual time=0.068..0.068 rows=200 loops=3)

Buckets: 1024 Batches: 1 Memory Usage: 36kB

Buffers: shared hit=12

-> Seq Scan on t2 b (actual time=0.007..0.031 rows=200 loops=3)

Buffers: shared hit=12

-> Hash (actual time=0.074..0.075 rows=200 loops=3)

Buckets: 1024 Batches: 1 Memory Usage: 36kB

Buffers: shared hit=12

-> Seq Scan on t3 c (actual time=0.014..0.039 rows=200 loops=3)

Buffers: shared hit=12

Planning Time: 0.197 ms

Execution Time: 3004.463 ms

This left-deep plan shows that it had to build three hash tables simultaneously, but the sizes are not that big(36kB + 36kB + 36kB). After building a hash table PostgreSQL scanned the large t4 in parallel and probed a hash table built from t1. With the resulting rows of the t1 and t2 join, it probed a hash table built from t2 and then some.

Now let's run the same query with different hints which produce the right-deep plan.

/+ leading((b (c (d a)))) HashJoin(d a) HashJoin(c d a) HashJoin(b c d a) */ --right-deep

explain

select count(a.col2), count(b.col2), count(c.col2), count(d.padding)

from t1 a, t2 b, t3 c, t4 d

where a.id = d.dim_key1

and b.id = d.dim_key2

and c.id = d.dim_key3;

The following is the execution plan of the right-deep tree.

Aggregate (cost=810135.00..810135.01 rows=1 width=32)

-> Hash Join (cost=636395.25..730135.00 rows=8000000 width=404)

Hash Cond: (b.id = d.dim_key2)

-> Seq Scan on t2 b (cost=0.00..6.00 rows=200 width=105)

-> Hash (cost=536395.25..536395.25 rows=8000000 width=307)

-> Hash Join (cost=443855.50..536395.25 rows=8000000 width=307)

Hash Cond: (c.id = d.dim_key3)

-> Seq Scan on t3 c (cost=0.00..6.00 rows=200 width=105)

-> Hash (cost=343855.50..343855.50 rows=8000000 width=210)

-> Hash Join (cost=8.50..343855.50 rows=8000000 width=210)

Hash Cond: (d.dim_key1 = a.id)

-> Seq Scan on t4 d (cost=0.00..233847.00 rows=8000000 width=113)

-> Hash (cost=6.00..6.00 rows=200 width=105)

-> Seq Scan on t1 a (cost=0.00..6.00 rows=200 width=105)

I omitted the block I/O and execution time deliberately. In my desktop the query failed producing an error message below.

ERROR: could not write to file "pg_tblspc/16387/PG_13_202007201/pgsql_tmp/pgsql_tmp1435.

1237": No space left on device

There must have been a lot of spills to disk. The work_mem was too small to process intermediate hash tables. In the above plan PostgreSQL will create a tiny hash table from t1 and probe the hash table with t4. The problem is that t4 (fact table) is huge. The output of joining t1 and t4 should be used as a hash table in the next join with t3. It requires tremendous amount of work_mem. In this particular example, we can infer that the size of the resulting set of joining t1 and t4 would be over 1 GB, considering the size of t4 is 1.2 GB. In my desktop the error message indicates that it spilled to disk and ran out of the disk space as well.

Even if there had been enough disk space, it would have taken a lot of time to get the result doing a lot of disk I/O.

We can safely say that in this particular case the right-deep execution plan was a bad idea.

Conclusion

If we do multi-table hash joins and the result set of the first join is too large, we should try to have the large intermediate result set as a probe set(not a build set). We should keep the intermediate build input set as small as possible.

E-MAIL: root@postgresdba.com