PostgreSQL 은 특별합니다.
단순한 데이터베이스가 아니라, 하나의 어플리케이션 플랫폼이기 때문입니다.

공지사항

뉴스 및 기사

	총 게시물 162건, 최근 0 건

Intimate Secret of Hash Join

글쓴이 : 모델광 날짜 : 2021-06-05 (토) 16:15 조회 : 1901

In the Addendum1 of the previous note I said that I needed to do some in-depth investigation into a hash join. Below are my findings.

The following is an excerpt from the PostgreSQL manual.

https://www.postgresql.org/docs/9.4/planner-optimizer.html

hash join: the right relation is first scanned and loaded into a hash table, using its join attributes as hash keys. Next the left relation is scanned and the appropriate values of every row found are used as hash keys to locate the matching rows in the table.

My understanding of the passage above is this:

When there are two tables to join, the optimizer builds a hash table using the build input table, and then probe the other table to locate the matching rows. So the optimizer visits each table just once.

But with some tests I came to a conclusion that the statement in the manual regarding the hash join missed one important point. After the planner decides which one of the two tables should be the build input table, the optimizer scans the probe table first prior to building an in-memory hash table from a build table in the work_mem area.

I supply test scripts for the demonstration and some explanations below.

drop table if exists t1 cascade;

drop table if exists t2 cascade;

create table t1

select trunc((i-1)/15) n1,

trunc((i-1)/15) n2,

rpad(i::text,180)::text v2

from generate_series(1,30000) a(i);

create table t2 as select * from t1;

--I made table t1 empty.

delete from t1;

--I inserted two rows in order to make dead tuples.

insert into t1 select * from t2 limit 2;

vacuum t1;

vacuum t2;

delete from t1;

analyze t1;

Below is a query we are interested in followed by its execution plan.

select *

from t1, t2

where t1.n1 = t2.n1

and t1.n2 = t2.n2

and t1.v2 = t2.v2;

Hash Join (actual time=0.118..0.127 rows=0 loops=1)

Hash Cond: ((t2.n1 = t1.n1) AND (t2.n2 = t1.n2) AND (t2.v2 = t1.v2))

Buffers: shared hit=8

-> Seq Scan on t2 (actual time=0.011..0.013 rows=1 loops=1)

Buffers: shared hit=1

-> Hash (actual time=0.012..0.015 rows=0 loops=1)

Buckets: 1024 Batches: 1 Memory Usage: 8kB

Buffers: shared hit=1

-> Seq Scan on t1 (actual time=0.011..0.012 rows=0 loops=1)

Buffers: shared hit=1

Planning:

Buffers: shared hit=100

Planning Time: 0.892 ms

Execution Time: 0.220 ms

In the execution plan above we have to pay attention to the fact that PostgreSQL visited one block in t2 table before accessing t1 and building a hash table. We know that t2 has 30000 rows, but the optimizer scanned just 1 row. If we stick to the statement in the manual, the optimizer should visit only t1. Because there are no rows in t1 and the optimizer cannot build a hash table, the optimizer doesn't have to try to find rows in t2 at all. But unlike what the manual says, the optimizer scanned t2 trying to find just 1 row. What does this mean? The mechanism in the Hash Join is "after the planner decides which table to be the build table, the first thing the optimizer does is not scanning the build input table, but trying to find a single row in the probe table. When it finds one row in the probe table, then the optimizer begins building a hash table with the build table and then it does scan the probe table to locate the matching rows. In the plan above, the optimizer could find a row from t2 after hitting just 1 block, so it went on scanning t1 trying to build a hash table. Consequently the optimizer couldn't find any rows in t1, so it stopped scanning t2 which had 30000 rows.

Here you might come up with an automatic question, "what will the optimizer do if it cannot find a row in t2? Will it go on scanning the build table and building a hash table?" You might assume that the optimizer will stop doing anything when it cannot find a row in t2. So let's do the test.

delete from t2;

Now both t1 and t2 are empty.

I executed the SQL statement again. Here is the resulting execution plan.

Hash Join (actual time=5.058..5.059 rows=0 loops=1)

Hash Cond: ((t2.n1 = t1.n1) AND (t2.n2 = t1.n2) AND (t2.v2 = t1.v2))

Buffers: shared hit=858

-> Seq Scan on t2 (actual time=5.057..5.057 rows=0 loops=1)

Buffers: shared hit=858

-> Hash (never executed)

-> Seq Scan on t1 (never executed)

Planning Time: 0.110 ms

Execution Time: 5.073 ms

As you might guess, when the optimizer couldn't find a row from t2, it never scanned the build table. Even though there are no rows in table t2, the optimizer hits 858 blocks which contain dead tuples looking for a single row.

Conclusion

When the hash join kicks in, PostgreSQL scans the probe table first looking for just one row. When it fails to take a row from the probe table, it stops doing anything such as accessing the build table.

PostgresDBA

2021-06-08 (화) 09:04

학구적이시네요!

댓글주소

E-MAIL: root@postgresdba.com