설문조사
PostgreSQL/PPAS 관련 듣고 싶은 교육은


총 게시물 162건, 최근 0 건
   

Intimate Secret of Hash Join

글쓴이 : 모델광 날짜 : 2021-06-05 (토) 16:15 조회 : 1901

In the Addendum1 of the previous note I said that I needed to do some in-depth investigation into a hash join. Below are my findings.


The following is an excerpt from the PostgreSQL manual.


https://www.postgresql.org/docs/9.4/planner-optimizer.html

hash join: the right relation is first scanned and loaded into a hash table, using its join attributes as hash keys. Next the left relation is scanned and the appropriate values of every row found are used as hash keys to locate the matching rows in the table.


My understanding of the passage above is this:

When there are two tables to join, the optimizer builds a hash table using the build input table, and then probe the other table to locate the matching rows. So the optimizer visits each table just once.

But with some tests I came to a conclusion that the statement in the manual regarding the hash join missed one important point. After the planner decides which one of the two tables should be the build input table, the optimizer scans the probe table first prior to building an in-memory hash table from a build table in the work_mem area.


I supply test scripts for the demonstration and some explanations below.


drop table if exists t1 cascade;

drop table if exists t2 cascade;

create table t1

as

select trunc((i-1)/15) n1,

       trunc((i-1)/15) n2,

       rpad(i::text,180)::text    v2

  from generate_series(1,30000) a(i); 

create table t2 as select * from t1;


--I made table t1 empty.

delete from t1;

--I inserted two rows in order to make dead tuples.

insert into t1 select * from t2 limit 2;

vacuum t1;

vacuum t2;

delete from t1;

analyze t1;


Below is a query we are interested in followed by its execution plan.


select *

  from t1, t2

 where t1.n1 = t2.n1

   and t1.n2 = t2.n2

   and t1.v2 = t2.v2;


 Hash Join (actual time=0.118..0.127 rows=0 loops=1)

   Hash Cond: ((t2.n1 = t1.n1) AND (t2.n2 = t1.n2) AND (t2.v2 = t1.v2))

   Buffers: shared hit=8

   ->  Seq Scan on t2 (actual time=0.011..0.013 rows=1 loops=1)

         Buffers: shared hit=1

   ->  Hash (actual time=0.012..0.015 rows=0 loops=1)

         Buckets: 1024  Batches: 1  Memory Usage: 8kB

         Buffers: shared hit=1

         ->  Seq Scan on t1 (actual time=0.011..0.012 rows=0 loops=1)

               Buffers: shared hit=1

 Planning:

   Buffers: shared hit=100

 Planning Time: 0.892 ms

 Execution Time: 0.220 ms


In the execution plan above we have to pay attention to the fact that PostgreSQL visited one block in t2 table before  accessing t1 and building a hash table. We know that t2 has 30000 rows, but the optimizer scanned just 1 row. If we stick to the statement in the manual, the optimizer should visit only t1. Because there are no rows in t1 and the optimizer cannot build a hash table, the optimizer doesn't have to try to find rows in t2 at all. But unlike what the manual says, the optimizer scanned t2 trying to find just 1 row. What does this mean? The mechanism in the Hash Join is "after the planner decides which table to be the build table, the first thing the optimizer does is not scanning the build input table, but trying to find a single row in the probe table. When it finds one row in the probe table, then the optimizer begins building a hash table with the build table and then it does scan the probe table to locate the matching rows. In the plan above, the optimizer could find a row from t2 after hitting just 1 block, so it went on scanning t1 trying to build a hash table. Consequently the optimizer couldn't find any rows in t1, so it stopped scanning t2 which had 30000 rows.

Here you might come up with an automatic question, "what will the optimizer do if it cannot find a row in t2? Will it go on scanning the build table and building a hash table?" You might assume that the optimizer will stop doing anything when it cannot find a row in t2. So let's do the test.


delete from t2;


Now both t1 and t2 are empty.

I executed the SQL statement again. Here is the resulting execution plan.


 Hash Join (actual time=5.058..5.059 rows=0 loops=1)

   Hash Cond: ((t2.n1 = t1.n1) AND (t2.n2 = t1.n2) AND (t2.v2 = t1.v2))

   Buffers: shared hit=858

   ->  Seq Scan on t2 (actual time=5.057..5.057 rows=0 loops=1)

         Buffers: shared hit=858

   ->  Hash (never executed)

         ->  Seq Scan on t1 (never executed)

 Planning Time: 0.110 ms

 Execution Time: 5.073 ms


As you might guess, when the optimizer couldn't find a row from t2, it never scanned the build table. Even though there are no rows in table t2, the optimizer hits 858 blocks which contain dead tuples looking for a single row. 


Conclusion

When the hash join kicks in, PostgreSQL scans the probe table first looking for just one row. When it fails to take a row from the probe table, it stops doing anything such as accessing the build table.


PostgresDBA 2021-06-08 (화) 09:04
학구적이시네요!
댓글주소
   

postgresdba.com