설문조사
PostgreSQL/PPAS 관련 듣고 싶은 교육은


총 게시물 162건, 최근 0 건
   

Converting Oracle Procedure to PostgreSQL

글쓴이 : 모델광 날짜 : 2021-05-26 (수) 06:59 조회 : 1939
Warren Buffet's golden rule is
"Rule No.1 is never lose money.
Rule No.2 is never forget Rule No.1."

PostgreSQL performance tuners' golden rule is
"Rule No.1 is never let PostgreSQL do the unnecessary job.
Rule No.2 is never forget Rule No.1."

This is a short note describing a tip you should be aware of when you convert Oracle procedures to PostgreSQL procedures.


When I was in a project where we were changing DBMS from Oracle to PostgreSQL, a procedure developer wrote a procedure which I thought was somewhat strange.


Below is the code a developer converted from an Oracle procedure.


DO $$

DECLARE

  v_ord_prod_id varchar(2) := 'C';

  l_cnt int:= 0;

  c1 record;

BEGIN

  --This is the code that checks if at least one record is present in the ORDS table before processing  

  --the rest of the statements in the procedure. I will call this code "check for existence sql" for the

  -- rest of this note.

 SELECT COALESCE(COUNT(*),0)

   INTO l_cnt

  WHERE EXISTS (select ord_id, ord_prod_id from ords where ord_prod_id = v_ord_prod_id);

 IF l_cnt = 0 THEN NULL;

 ELSE

   FOR c1 IN (

             select ord_id, ord_prod_id from ords where ord_prod_id = v_ord_prod_id

             ) LOOP

   --This is the pseudo code that should be executed with extracted values of (ord_id, prod_id)

   raise notice '%,%', c1.ord_id, c1.ord_prod_id;

   END LOOP;

 END IF;

END;

$$


We can infer that the developer must have thought that the "check for existence sql" would avoid executing redundant FOR ~ IN clause when it finds that there is no data that matches the the condition (where ord_prod_id = v_ord_prod_id).

But, to me  the red-colored code was just redundant because PostgreSQL can produce an efficient execution plan even when there is no row satisfying the predicate if there is an index on the column. The check for existence code was just consuming DBMS CPU load and block I/Os for nothing. And there are slim chances that the check-for-existence-sql will result in an inconsistent database state. When you execute the check-for-existence-sql you may find a row but when you run the FOR ~ IN clause you may find that there are no rows to process because something modified the row you found on the first pass. Will this make the program crash? We cannot be sure what will happen.


So I revised the code above like this:


DO $$

DECLARE

   v_ord_prod_id varchar(2) := 'C';

  c1 record;

BEGIN

   FOR c1 IN (

             select ord_id, ord_prod_id from ords where ord_prod_id = v_ord_prod_id

             ) LOOP

    --This is the pseudo code that should be executed with the value of (ord_id, prod_id)

   raise notice '%,%', c1.ord_id, c1.ord_prod_id;  

   END LOOP;

END;

$$

LANGUAGE 'plpgsql';


I just eliminated the check-for-existence part of the procedure. I thought that if there were not any rows from the ORDS table filtered by the where clause, the RAISE part would not be executed.  When I handed over this code to the developer, the developer refused to ship this rewritten code. The developer said, "The procedure has long been working in ORACLE, there must be some reason we don't know why the Oracle procedure developer had to put in the check-for-existence code."

So I had to investigate the Oracle procedure to find out why the check-for-existence query was put in.  The following is the script I used to explain why the check-for-existence query is redundant in PostgreSQL while it may be necessary in ORACLE from the viewpoint of elapsed time.


DROP TABLE ords;

CREATE TABLE ords (

ORD_ID INT NOT NULL,

ORD_PROD_ID VARCHAR(2) NOT NULL,

ORD_STATUS CHAR(1) NOT NULL,

ETC_CONTENT VARCHAR(100));


--drop index ords_X02;

create index ords_x02 on ords(ord_prod_id, ord_status);


INSERT INTO ords

SELECT i

      ,case when mod(i,3) = 0 then 'A'

            when mod(i,3) = 1 then 'B'

            else 'C' end

      ,chr(64+case when i <= 10 then i else 11 end)

      ,rpad('x',100,'x')

  FROM generate_series(1,10000000) a(i);

analyze ORDS;

Let's take a look at the statistics of the ORDS table.

select tablename, attname, most_common_vals, most_common_freqs
  from pg_stats where tablename='ords';
tablename   attname,   most_common_vals,   most_common_freqs
ords       ord_prod_id      {C,B,A}             {0.33556667,0.33223334,0.3322}

The column ord_prod_id has only 3 distinct values and they are C,B and A.
Below is the query we used in the C1 cursor clause followed by its execution plan in PostgreSQL. Let's say that we provided the value 'C' for the v_ord_prod_id variable.

select ord_id, ord_prod_id from ords where ord_prod_id = 'C';

 Seq Scan on ords (actual time=0.348..1801.194 rows=3333333 loops=1)
   Filter: ((ord_prod_id)::text = 'C'::text)
   Rows Removed by Filter: 6666667
   Buffers: shared hit=4098 read=168316
 Planning:
   Buffers: shared hit=20 read=1
 Planning Time: 0.244 ms
 Execution Time: 1896.988 ms


Even though we have an index on the ORD_PROD_ID column, the optimizer didn't access the index which is reasonable because the selectivity of the column is not good. What will happen if  the optimizer produced an inefficient plan for some reason? To check that, I ran the query below.


/+ IndexScan(ords ords_x02) */  -- I force the optimizer to access the index.

select ord_id, ord_prod_id from ords where ord_prod_id = 'C';


 Index Scan using ords_x02 on ords (actual time=3.487..2691.037 rows=3333333 loops=1)

   Index Cond: ((ord_prod_id)::text = 'C'::text)

   Buffers: shared hit=1 read=175222

 Settings: search_path = 'portal'

 Planning:

   Buffers: shared hit=157

 Planning Time: 1.316 ms

 Execution Time: 2798.704 ms


We can see that PostgreSQL's decision to do Seq Scan was right. The block I/Os and the elapsed time increased when it did an Index Scan.


What will happen if we assign a value which doesn't exist in the ORD_PROD_ID column to the variable?

Let's assume that we assign the value 'K' to the v_ord_prod_id variable.


select ord_id, ord_prod_id from ords where ord_prod_id = 'K';


Index Scan using ords_x02 on ords  (cost=0.43..8.02 rows=1 width=6) 

                                                          (actual time=0.027..0.027 rows=0  loops=1)

   Index Cond: ((ord_prod_id)::text = 'K'::text)

   Buffers: shared hit=3

 Planning Time: 0.061 ms

 Execution Time: 0.034 ms


Now PostgreSQL estimated it will just extract 1 row and accesses the index on the column ORD_PROD_ID. This efficient plan was produced because PostgreSQL already knew that it had only 'A','B',and 'C' in the ORD_PROD_ID column. So if we assign a value which doesn't exist in the ORD_PROD_ID column to the variable, PostgreSQL will run the sql in the C1 cursor clause with short elapsed time (0.034 ms) and then it will not execute the LOOP clause.


But what will happen in Oracle if we assign a value which doesn't exist in the ORD_PROD_ID column to the revised procedural code?

Here we have to know the architectural difference between ORACLE and PostgreSQL.

In PostgreSQL, bind-peeking is mandatory. In the procedure when we execute this sql in the C1 cursor:


select ord_id, ord_prod_id from ords where ord_prod_id = v_ord_prod_id


PostgreSQL peeks the value of v_ord_prod_id variable and with the value of v_ord_prod_id (Say, 'K') it can refer to its frequency histogram and estimate that the ORD_PROD_ID column has only 1 row with the value of 'K'. So PostgreSQL could produce an efficient execution plan accessing the index. So the total elapsed time of the procedure is very short.

On the other hand, In ORACLE it doesn't work that way. Surely Oracle also has the capability of peeking the value of v_ord_prod_id. But most ORACLE DBAs disable the capability for the sake of stable execution plans. They set the ORACLE parameter '_optim_peek_user_binds' to 'FALSE'. So if you execute the query in the C1 cursor assigning the value 'K' to the variable, ORACLE doesn't do the index scan. The Oracle optimizer does a full table scan (2798 ms) while the PostgreSQL optimizer does the Index Scan(0.034 ms)And it will take longer to execute the procedure in Oracle.


That's why in ORACLE procedure, the developer checked for the existence of data before going into the FOR ~ IN clause. In ORACLE the check-for-existence query accesses the index on the column ORD_PROD_ID because it knows that it can stop reading the index when 1 row is extracted. So when Oracle is assigned the value of v_ord_prod_id variable which doesn't exist, it can skip the FOR ~ IN clause with good elapsed time.


Conclusion

When you convert ORACLE procedures to PostgreSQL procedures, be cautious of the logic and the architectural difference. If possible, let PostgreSQL do less job.

 

Addendum 1

What should we do if we don't have an index on the column ORD_PROD_ID? In that case, the performance degradation is inevitable whether you put in the check-for-existence query or not.


Addendum 2

If we must check if at least one record is present in the table, I would rather write the query like this:

The query below is more intelligible than the query above. The query above is a kind of Oracle dialect. Oracle doesn't recognize the query below which is of ANSI standard format.

SELECT COUNT(*)

  FROM (

            SELECT ord_prod_id

              FROM ords

            WHERE ord_prod_id = v_ord_prod_id

            FETCH FIRST 1 ROWS ONLY

            ) A

   INTO l_cnt;


Below is the execution plan produced by the optimizer when we assign the value 'K' to the v_ord_prod_id variable.


 Aggregate (actual time=0.022..0.022 rows=1 loops=1)

   Buffers: shared hit=3

   ->  Limit (actual time=0.019..0.019 rows=0 loops=1)

         Buffers: shared hit=3

         ->  Index Only Scan using ords_x02 on ords (actual time=0.018..0.018 rows=0 loops=1)

               Index Cond: (ord_prod_id = 'K'::text)

               Heap Fetches: 0

               Buffers: shared hit=3

 Planning Time: 0.201 ms

 Execution Time: 0.090 ms


Below is the execution plan produced when we assign the value 'C' to the variable.

Even though it did a full table scan, it stopped scanning the ORDS table as soon as it got one row.


 Aggregate (actual time=0.027..0.027 rows=1 loops=1)

   Buffers: shared hit=1

   ->  Limit (actual time=0.022..0.022 rows=1 loops=1)

         Buffers: shared hit=1

         ->  Seq Scan on ords (actual time=0.020..0.020 rows=1 loops=1)

               Filter: ((ord_prod_id)::text = 'C'::text)

               Rows Removed by Filter: 1

               Buffers: shared hit=1

 Planning:

   Buffers: shared hit=20

 Planning Time: 0.470 ms

 Execution Time: 0.174 ms


   

postgresdba.com