설문조사
PostgreSQL/PPAS 관련 듣고 싶은 교육은


총 게시물 162건, 최근 0 건
   

Converting Oracle Procedure to PostgreSQL 2

글쓴이 : 모델광 날짜 : 2021-06-02 (수) 06:41 조회 : 1856

Well, this is a kind of "mini-series" describing tips you should be aware of when you convert Oracle procedures to PostgreSQL procedures.

I hope this mini-series will offer a much better learning experience for the readers than they could get from the random collection of articles.


The following is a stripped-down code a developer converted from an Oracle procedure.


DO $$

DECLARE

   l_cnt int:= 0;

   c1 RECORD;

BEGIN

  --This is the code that checks if at least one record is present in the t1 table before processing  

  --the rest of the statements in the procedure. I will call this code "check for existence sql" for the

  -- rest of this note.

 SELECT COUNT(*)

   INTO l_cnt

  WHERE EXISTS (select 1 from t1);

 IF l_cnt = 0 THEN NULL;

 ELSE

   FOR c1 IN (

     --I would call this code block "EXCEPT query" for the rest of this note

      SELECT n1, n2, v2 FROM t1

       EXCEPT

      SELECT n1, n2, v2 FROM t2

            )

   LOOP

 --This is the pseudo code that should be executed with the extracted values of (n1,n2,v2)

      raise notice '%,%,%', c1.n1, c1.n2, c1.v2;

    END LOOP;

 END IF;

END;

$$


The writer of this code is checking for existence of rows in the t1 table before running the EXCEPT query. What will happen if we run the query without checking for existence of output and there are no rows in the t1 table? The writer of the code must have thought that the query will have to scan the t2 table whether or not there are any rows in the t1 table.


Below are the scripts to check the block I/Os and the elapsed time of the query.


drop table if exists t1 cascade;

drop table if exists t2 cascade;

create table t1

as

select trunc((i-1)/15) n1,

       trunc((i-1)/15) n2,

       rpad(i::text,180)::text    v2

  from generate_series(1,30000) a(i); 

create table t2 as select * from t1;

--I made the t1 table empty.

delete from t1;

vacuum t1;

vacuum t2;


Below is the SQL statement I ran. Note that t1 is empty now.

select  * from t1

except

select * from t2;


 HashSetOp Except (actual time=11.997..12.001 rows=0 loops=1)

   Buffers: shared hit=858

   ->  Append (actual time=0.013..7.591 rows=30000 loops=1)

         Buffers: shared hit=858

         ->  Subquery Scan on "*SELECT* 1" (actual time=0.003..0.003 rows=0 loops=1)

               ->  Seq Scan on t1 (actual time=0.002..0.002 rows=0 loops=1)

         ->  Subquery Scan on "*SELECT* 2" (actual time=0.009..5.996 rows=30000 )

               Buffers: shared hit=858

               ->  Seq Scan on t2 (actual time=0.007..1.871 rows=30000 loops=1)

                     Buffers: shared hit=858

 Planning:

   Buffers: shared hit=40

 Planning Time: 0.309 ms

 Execution Time: 12.102 ms


Even though there are no rows in t1, PostgreSQL did a Seq Scan on t2, which is inefficient. So it was a proper measure for the developer to check for existence of rows in t1. If there were not any rows in t1, the EXCEPT query would not be run at all, which seemed to be a very reasonable logic.

However, what will happen if there are any rows in t1? Then you are having 2 DBMS calls. One is for checking for existence. The other is for running EXCEPT query. And there are slight chances that the check-for-existence query may result in an inconsistent database state. When you execute the check-for-existence query you may find a row but when you run the EXCEPT query you may find that there are no rows in the t1 table because another transaction deleted a row you found on the first pass. We cannot guarantee nothing happens between the 2 DBMS calls.


If you are sure that the first subquery in the EXCEPT query is going to be cheap and you're worried that the second subquery is expensive, you could do the following.


DO $$

DECLARE

  c1 RECORD;

BEGIN

 FOR c1 IN

   --I would call this code block "EXCEPT query" for the rest of this note

   SELECT n1, n2, v2

     FROM (

           SELECT n1, n2, v2 FROM t1

           EXCEPT

           SELECT n1, n2, v2 FROM t2) V

    WHERE EXISTS (SELECT 1 FROM t1)

 LOOP

 --This is the pseudo code that should be run with the extracted values of n1,n2 and v2.

   raise notice '%,%,%', c1.n1, c1.n2, c1.v2;

  END LOOP;

END;

$$


I made just 1 DBMS call thrown instead of checking for existence and then running the EXCEPT query. Below is the execution plan when there are no rows in the t1 table.


 Result (actual time=0.006..0.007 rows=0 loops=1)

   One-Time Filter: $0

   InitPlan 1 (returns $0)

     ->  Seq Scan on t1 t1_1 (actual time=0.003..0.004 rows=0 loops=1)

   ->  Subquery Scan on v (never executed)

         ->  HashSetOp Except (never executed)

               ->  Append (never executed)

                     ->  Subquery Scan on "*SELECT* 1" (never executed)

                           ->  Seq Scan on t1 (never executed)

                     ->  Subquery Scan on "*SELECT* 2" (never executed)

                           ->  Seq Scan on t2 (never executed)

 Planning Time: 0.243 ms

 Execution Time: 0.076 ms


As you can see PostgreSQL never executed the upper part nor the lower part in the EXCEPT query. As we vacuumed the t1 table after deleting all the rows, the WHERE EXISTS clause was not executed. If we don't vacuum the t1 table, PostgreSQL will have to scan the empty t1 table for dead tuples, but the amount of block I/Os will not be that big.


Below is the execution plan when there are some rows in the t1 table.


--PLAN 1  -- I call this plan "PLAN 1" in order to compare it with another plan.

Result (actual time=30.522..31.476 rows=11 loops=1)

   One-Time Filter: $0

   Buffers: shared hit=1717

   InitPlan 1 (returns $0)

     ->  Seq Scan on t1 t1_1 (actual time=0.013..0.013 rows=1 loops=1)

           Buffers: shared hit=1

   ->  Subquery Scan on v (actual time=30.506..31.457 rows=11 loops=1)

         Buffers: shared hit=1716

         ->  HashSetOp Except (actual time=30.504..31.453 rows=11 loops=1)

               Buffers: shared hit=1716

               ->  Append (actual time=0.004..12.660 rows=60000 loops=1)

                     Buffers: shared hit=1716

                     ->  Subquery Scan on "*SELECT* 1" (actual time=0.003..5.351 rows=30000 loops=1)

                           Buffers: shared hit=858

                           ->  Seq Scan on t1 (actual time=0.002..2.103 rows=30000 loops=1)

                                 Buffers: shared hit=858

                     ->  Subquery Scan on "*SELECT* 2" (actual time=0.006..4.322 rows=30000 loops=1)

                           Buffers: shared hit=858

                           ->  Seq Scan on t2 (actual time=0.005..1.596 rows=30000 loops=1)

                                 Buffers: shared hit=858

 Planning:

   Buffers: shared hit=11

 Planning Time: 0.162 ms

 Execution Time: 32.556 ms


We can see that the optimizer first executed WHERE EXISTS (SELECT 1 FROM t1) clause. As soon as it found a row hitting just 1 block, it executed the rest of the query in the EXCEPT query. So when there are rows in the t1 table, the optimizer is just visiting 1 block to determine whether it should run the rest of the query. The additional elapsed time is "only 0.013 ms". (depending on your viewpoint I admit that it can be "a huge 0.013ms") One more advantage of the trick above is that we shouldn't worry about the data inconsistency.


Conclusion

When you write a procedure, don't let PostgreSQL do some unnecessary job.

Try to reduce the number of DBMS calls.


Footnote

You can see the source code of the HashSetOp in the site below.


https://github.com/postgres/postgres/blob/master/src/backend/executor/nodeSetOp.c


Addendum 1

When I have to extract rows which don't exist in another table, I would rather write a query like this:


select n1, n2, v2

  from t1

 where not exists (select 1

                     from t2

                    where t1.n1 = t2.n1

                      and t1.n2 = t2.n2

                      and t1.v2 = t2.v2);


Below is the execution plan produced by the optimizer when there are some rows in the t1 table.


--PLAN 2

 Hash Anti Join (actual time=12.733..35.390 rows=111 loops=1)

   Hash Cond: ((t1.n1 = t2.n1) AND (t1.n2 = t2.n2) AND (t1.v2 = t2.v2))

   Buffers: shared hit=1717, temp read=800 written=800

   ->  Seq Scan on t1 (actual time=0.008..1.985 rows=30000 loops=1)

         Buffers: shared hit=858

   ->  Hash (actual time=12.455..12.455 rows=30000 loops=1)

         Buckets: 32768  Batches: 2  Memory Usage: 3679kB

         Buffers: shared hit=858, temp written=400

         ->  Seq Scan on t2 (actual time=0.003..2.125 rows=30000 loops=1)

               Buffers: shared hit=858

 Planning:

   Buffers: shared hit=33

 Planning Time: 0.484 ms

 Execution Time: 35.457 ms


Below is the execution plan when there are no rows in the t1 table.


 Hash Anti Join (actual time=0.012..0.014 rows=0 loops=1)

   Hash Cond: ((t1.n1 = t2.n1) AND (t1.n2 = t2.n2) AND (t1.v2 = t2.v2))

   Buffers: shared hit=1

   ->  Seq Scan on t1 (actual time=0.011..0.012 rows=0 loops=1)

         Buffers: shared hit=1

   ->  Hash (never executed)

         ->  Seq Scan on t2 (never executed)

 Planning Time: 0.216 ms

 Execution Time: 0.039 ms


Wow! Aren't you surprised by this execution plan? Before accessing table t2 to make a build input hash table, PostgreSQL reads table t1. Because there are no rows found in the probe input table, PostgreSQL didn't scan the t2 table at all. It's like Artificial Intelligence kicking in.


The following is what I think the optimizer does in an execution plan using the join method of Hash Join.

1) The optimizer decides that it uses the Hash join method and it uses t2 as an build input.

2) The optimizer calls a function to do a table scan of t2 and return all the relevant tuples, building an in-memory hash table by applying a hashing function to the join columns of each row returned by the call to the table scan.

3) Then the optimizer calls a function to start a table scan of t1. For each row returned, the optimizer applies the same hashing function to the join columns and checks the hash table to see if there is a matching row in the hash table.


Unlike what I described about the Hash Join, the optimizer kind of peeked table t1 first in the plan above. I need to do some more investigation into the Hash Join.


Addendum 2

When I compared PLAN 1 with PLAN 2, I couldn't understand why PLAN 1 doesn't use temp files while PLAN 2 was using temp files. Both the HashSetOp and the Hash Anti Join must make a hash table in order to do its job. So I put up a question on the dba.stackexechange.com. And a person named JJANES gave me a clue on why PLAN 2 shows that it overruns work_mem, while PLAN 1 doesn't.

Here is the URL of my question followed by his or her answer.

https://dba.stackexchange.com/questions/292287/in-postgresql-what-area-in-memory-does-hashsepop-use-work-mem-or-shared-buffer


work_mem is not an "area", it is a just value.

HashSetOp is pretty poor at memory estimation and usage. It has no provision for spilling to disk. If it ends up using more memory than work_mem is set to, then it just goes ahead and uses more memory than it was supposed to. There is no external enforcement mechanism (until the kernel steps in), and in the case of HashSetOp no internal one either.

If the planner thinks ahead of time that HashSetOp would overrun work_mem, then it just doesn't get chosen, and you will probably get some sort-based method instead. But again, the estimation is poor, it will often use several times more memory than the planner thought it would, and will happily blow way past the setting of work_mem.

HashSetOp could really use some improvements, but it is so seldom used that no one is very excited about doing the work.


   

postgresdba.com