설문조사
PostgreSQL/PPAS 관련 듣고 싶은 교육은


총 게시물 118건, 최근 0 건
   

Function peformance and volatility categories

글쓴이 : 모델광 날짜 : 2022-04-30 (토) 20:56 조회 : 384

This is a short note on the peformance of User Definded Functions and a volatility classification.

I have been working with PG for 3 years. As time goes on, I realize how foolish I was.

The following is one of the things I missed in the projects I worked on in the past.

Usually I guide developers not to create UDFs, because UDFs do a lot of harm to the performance of DB systems. And when I am forced to allow them to create UDFs, I just teach them how to reduce UDF calls, not giving any instructions on how to write UDFs.

Recently I read the PG document and found out that depending on the volatility of the UDF, the performance can be different.

Below is the URL of the menual.

https://www.postgresql.org/docs/9.5/xfunc-volatility.html


I did some performance experiments on the volatility categories.

Here are sample codes running under PostgreSQL 13.


CREATE TABLE ORDERS_DETAIL(ORD_LINE_NO BIGINT NOT NULL,ORD_NO BIGINT NOT NULL,PROD_ID VARCHAR(10) NOT NULL,COMMENT VARCHAR(100),ORD_AMT BIGINT);


ALTER TABLE ORDERS_DETAIL ADD CONSTRAINT PK_ORDERS_DETAIL PRIMARY KEY(ORD_LINE_NO);


CREATE TABLE PROD (PROD_ID VARCHAR(10) NOT NULL,PROD_NM VARCHAR(100) NOT NULL);


ALTER TABLE PROD ADD CONSTRAINT PK_PROD PRIMARY KEY(PROD_ID);

INSERT INTO ORDERS_DETAIL

SELECT ias ORD_LINE_NO  

          , mod(i,1000000) AS ORD_NO  

          , 'PP'||MOD(i,5) AS PROD_ID  

          , lpad('X',10,'Y') as comment

      , case when i< 1000 then i*100 else iend as prod_amt

  FROM generate_series(1,10000000) a(i);


INSERT INTO PROD    

 SELECT PROD_ID

           , MAX(ORD_NO)||'TEST_NAME'        

  FROM ORDERS_DETAIL

 GROUP BY PROD_ID;


--I have created a pure sql function with VOLATILE. (Volatile id default)

CREATE OR REPLACE FUNCTION portal.f_get_prod_nm(text)

 RETURNS text

 LANGUAGE sql

AS $$

SELECT PROD_NM

  FROM PROD

 WHERE PROD_ID = $1;

$$;


--I have created an sql function with STABLE volatility.

 CREATE OR REPLACE FUNCTION portal.f_get_prod_nm_pure(text)

 RETURNS text

 STABLE

 LANGUAGE sql

AS $$

SELECT PROD_NM

  FROM PROD

 WHERE PROD_ID = $1;

 $$;


--I have created a volatile PL/PgSQL function. (Volatile is default.)

CREATE OR REPLACE FUNCTION F_GET_PROD_NM_plpgsql(I_PROD_ID VARCHAR)

RETURNS VARCHAR

AS $$

DECLARE  O_PROD_NM VARCHAR(100);

BEGIN  

    SELECT PROD_NM   

        INTO O_PROD_NM   

      FROM PROD  

    WHERE PROD_ID = I_PROD_ID;  

     RETURN O_PROD_NM;

END; $$

LANGUAGE plpgsql;


--I have created a STABLE PL/PgSQL function.

CREATE OR REPLACE FUNCTION F_GET_PROD_NM_stable(I_PROD_ID VARCHAR)

RETURNS VARCHAR

STABLE

AS $$

DECLARE  O_PROD_NM VARCHAR(100);

BEGIN  

    SELECT PROD_NM   

        INTO O_PROD_NM   

      FROM PROD  

    WHERE PROD_ID = I_PROD_ID;  

     RETURN O_PROD_NM;

END; $$

LANGUAGE plpgsql;


--I have created an IMMUTABLE PL/PgSQL function

CREATE OR REPLACE FUNCTION 

F_GET_PROD_NM_immutable(I_PROD_ID VARCHAR)

RETURNS VARCHAR

IMMUTABLE

AS $$

DECLARE  O_PROD_NM VARCHAR(100);

BEGIN  

    SELECT PROD_NM   

        INTO O_PROD_NM   

      FROM PROD  

    WHERE PROD_ID = I_PROD_ID;  

     RETURN O_PROD_NM;

END; $$

LANGUAGE plpgsql;


Below are the SQL statements to check UDFs' performance followed by their excution plans.


SELECT A.ORD_NO, F_GET_PROD_NM(A.PROD_ID) AS PROD_NM   

   FROM ORDERS_DETAIL A          

 WHERE A.ORD_NO BETWEEN 1 AND 100000;


 Seq Scan on orders_detail a (actual time=0.221..3589.892 rows=1000000 loops=1)

   Filter: ((ord_no >= 1) AND (ord_no <= 100000))

   Rows Removed by Filter: 9000000

   Buffers: shared hit=1000627 read=82726

 Planning:

   Buffers: shared hit=400

 Planning Time: 8.921 ms

 Execution Time: 3637.446 ms


SELECT A.ORD_NO, F_GET_PROD_NM_PURE(A.PROD_ID) AS PROD_NM   

   FROM ORDERS_DETAIL A          

 WHERE A.ORD_NO BETWEEN 1 AND 100000;


 Seq Scan on orders_detail a (actual time=0.251..3328.829 rows=1000000 loops=1)

   Filter: ((ord_no >= 1) AND (ord_no <= 100000))

   Rows Removed by Filter: 9000000

   Buffers: shared hit=1001107 read=82246

 Planning:

   Buffers: shared hit=400

 Planning Time: 2.277 ms

 Execution Time: 3370.142 ms


We can notice that an sql funtion with STABLE volatility is faster(3637ms->3370ms).


SELECT A.ORD_NO, F_GET_PROD_NM_PLPGSQL(A.PROD_ID) AS PROD_NM   

   FROM ORDERS_DETAIL A          

 WHERE A.ORD_NO BETWEEN 1 AND 100000;

 Seq Scan on orders_detail a (actual time=0.120..4559.058 rows=1000000 loops=1)

   Filter: ((ord_no >= 1) AND (ord_no <= 100000))

   Rows Removed by Filter: 9000000

   Buffers: shared hit=1000672 read=82662

 Planning Time: 0.095 ms

 Execution Time: 4602.952 ms


SELECT A.ORD_NO, F_GET_PROD_NM_STABLE(A.PROD_ID) AS PROD_NM   

   FROM ORDERS_DETAIL A          

 WHERE A.ORD_NO BETWEEN 1 AND 100000; 


 Seq Scan on orders_detail a (actual time=0.198..4309.964 rows=1000000 loops=1)

   Filter: ((ord_no >= 1) AND (ord_no <= 100000))

   Rows Removed by Filter: 9000000

   Buffers: shared hit=1000704 read=82630

 Planning Time: 0.048 ms

 Execution Time: 4351.524 ms


The elapsed time reduced from 4602 ms to 4351 ms by using a STABLE function.


SELECT A.ORD_NO, F_GET_PROD_NM_IMMUTABLE(A.PROD_ID) AS PROD_NM   

   FROM ORDERS_DETAIL A          

 WHERE A.ORD_NO BETWEEN 1 AND 100000; 


 Seq Scan on orders_detail a  (cost=0.00..476799.75 rows=973863 width=40) (actual time=0.212..4243.079 rows=1000000 loops=1)

   Filter: ((ord_no >= 1) AND (ord_no <= 100000))

   Rows Removed by Filter: 9000000

   Buffers: shared hit=1000736 read=82598

 Planning Time: 0.085 ms

 Execution Time: 4285.180 ms


I wanted to view statistics about executions of the functions.


set track_functions = 'all';

--You execute the queries above.

analdb=# select * from pg_stat_user_functions;

 funcid | schemaname |        funcname         |  calls  | total_time | self_time

--------+------------+-------------------------+---------+------------+-----------

  42612 | portal     | f_get_prod_nm           | 1000000 |   2722.779 |  2722.779

50937 | portal     | f_get_prod_nm_pure      | 1000000 |    2536.63 |   2536.63

  50934 | portal     | f_get_prod_nm_plpgsql   | 1000000 |   3901.993 |  3901.993

  50935 | portal     | f_get_prod_nm_stable    | 1000000 |   3809.062 |  3809.062

  50936 | portal     | f_get_prod_nm_immutable | 1000000 |   3804.095 |  3804.095


Conclusion

1. In most cases sql functions which only use pure SQL are a better solution than PL/PgSQL funtions. When you create a function without specifying language you are creating PL/PgSQL functions which are less efficient. When you don't need any procedural elements or variables that are not available in sql functions, You should not create PL/PgSQL functions.

2. When you create functions you have to check the function's volatility.  When you are sure that the output value of the UDF to the input value is fixed or deterministic, you'd better create the UDF with STABLE volatility.


FOOTNOTE 1

You can reset the statistics in the pg_stat_user_functions.

select pg_stat_reset_single_function_counters(oid);


FOOTNOTE 2

Regarding the difference between the language sql and the language PL/PgSQL, refer to the answer on the site below.

https://stackoverflow.com/questions/24755468/difference-between-language-sql-and-language-plpgsql-in-postgresql-functions


Addendum

Regarding PostgreSQL internal functions, there are three kinds of functions.

* Immutable Function : If you provide the same parameter set, the planner can calculate it during its planning time. So the internal function becomes a constant. Therefore, it can use its statistics.

* Stable Function      : You read the database and you don't change it.

* Volatile Function     : The function changes the database value.


* Stable and volatile functions will not be calculated during planning time.


The following is a test script to check the peformance difference of stable and immutable.


 create table t2 (c1 int, c2 date);

insert into t2

 select i, make_date(2021,12,31) -i

   from generate_series(1,900000) a(i);


MAKE_DATE is an immutable function.


select *

  from t2

 where c2 >= make_date(1200,1,1);


 Seq Scan on t2  (cost=0.00..15233.00 rows=301398 width=8) (actual time=0.033..162.367 rows=300229 loops=1)

   Filter: (c2 >= '1200-01-01'::date)

   Rows Removed by Filter: 599771

 Planning Time: 1.087 ms

 Execution Time: 176.750 ms

(5 rows)


TO_DATE is a stable function.

select *

  from t2

 where c2 >= to_date('12000101','yyyymmdd');


 Seq Scan on t2  (cost=0.00..17483.00 rows=301398 width=8) (actual time=0.027..668.266 rows=300229 loops=1)

   Filter: (c2 >= to_date('12000101'::text, 'yyyymmdd'::text))

   Rows Removed by Filter: 599771

 Planning Time: 0.565 ms

 Execution Time: 682.389 ms


The elapsed time increased from 176 to 682 ms. We can infer that the function MAKE_DATE was executed only once at planning time and the function TO_DATE was executed 999999 times.

In order to make a stable function to run only once, you should use a subquery.


select *

  from t2

 where c2 >= (select to_date('12000101','yyyymmdd'));


Seq Scan on t2  (cost=0.01..15233.01 rows=300000 width=8) (actual time=0.063..134.467 rows=300229 loops=1)

   Filter: (c2 >= $0)

   Rows Removed by Filter: 599771

   InitPlan 1 (returns $0)

     ->  Result  (cost=0.00..0.01 rows=1 width=4) (actual time=0.012..0.013 rows=1 loops=1)

 Planning Time: 0.128 ms

 Execution Time: 148.075 ms


The drawback is that access to an InitPlan or SubPlan is always parallel restricted. A parallel restricted operation is one which cannot be performed in a parallel worker, but which can be performed in the leader while parallel query is in use.


In the same context, you are recommended not to use a to_timestamp function. 

Here is a sample code to demonstrate my point.


select *

  from t2

 where c2 >= to_timestamp('12000101','yyyymmdd');


 Seq Scan on t2  (cost=0.00..17483.00 rows=301398 width=8) (actual time=0.021..753.129 rows=300229 loops=1)

   Filter: (c2 >= to_timestamp('12000101'::text, 'yyyymmdd'::text))

   Rows Removed by Filter: 599771

 Planning Time: 0.142 ms

 Execution Time: 766.851 ms


Here is the SQL statement I optimized.


explain (analyze)

select *

  from t2

 where c2 >= '12000101'::timestamp;


 Seq Scan on t2  (cost=0.00..15233.00 rows=301398 width=8) (actual time=0.023..130.683 rows=300229 loops=1)

   Filter: (c2 >= '1200-01-01 00:00:00'::timestamp without time zone)

   Rows Removed by Filter: 599771

 Planning Time: 0.093 ms

 Execution Time: 143.832 ms


We can see that the elapsed time decreased from 766 ms to 143 ms.

I can’t guarantee the correctness of the description I’ve given here, but it’s probably fairly accurate.

"The stable internal function to_timestamp('12000101','yyyymmdd') was executed 999999 times. When you replace to_timestamp('12000101','yyyymmdd') with  '12000101'::timestamp, you can get tremendous performance boost because the latter is run only once."


   

postgresdba.com