설문조사
PostgreSQL/PPAS 관련 듣고 싶은 교육은


총 게시물 167건, 최근 0 건
   

UDF ISSUE 2

글쓴이 : 모델광 날짜 : 2023-08-20 (일) 12:02 조회 : 633
When I am in a position to establish a SQL guide, I prohibit the creation of user-defined functions in a database because I have encountered many performance issues.
It is very difficult to debug and maintain a UDF. Troubleshooting issues within UDFs requires specialized knowledge and debugging techniques. If you are not familiar with the business requirements, you will struggle to debug a UDF.
However, when we join a project, there are many database systems that have already UDFs in them. Recently, I came across a performance issue related to a UDF function.
In this note, I will explain how I improved a query that utilized a UDF.

I will start with a simple script to generate a sample data set:


CREATE TABLE emp (
    empno integer NOT NULL,
    ename character varying(10),
    job character varying(9),
    mgr integer,
    hiredate timestamp without time zone,
    sal numeric(7,2),
    comm numeric(7,2),
    deptno integer
);

INSERT INTO EMP VALUES
(7369,'SMITH','CLERK',7902,to_date('17-12-1980','dd-mm-yyyy'),800,NULL,20),
(7499,'ALLEN','SALESMAN',7698,to_date('20-2-1981','dd-mm-yyyy'),1600,300,30),
(7521,'WARD','SALESMAN',7698,to_date('22-2-1981','dd-mm-yyyy'),1250,500,30),
(7566,'JONES','MANAGER',7839,to_date('2-4-1981','dd-mm-yyyy'),2975,NULL,20),
(7654,'MARTIN','SALESMAN',7698,to_date('28-9-1981','dd-mm-yyyy'),1250,1400,30),
(7698,'BLAKE','MANAGER',7839,to_date('1-5-1981','dd-mm-yyyy'),2850,NULL,30),
(7782,'CLARK','MANAGER',7839,to_date('9-6-1981','dd-mm-yyyy'),2450,NULL,10),
(7788,'SCOTT','ANALYST',7566,to_date('13-7-1987','dd-mm-yyyy'),3000,NULL,20),
(7839,'KING','PRESIDENT',NULL,to_date('17-11-1981','dd-mm-yyyy'),5000,NULL,10),
(7844,'TURNER','SALESMAN',7698,to_date('8-9-1981','dd-mm-yyyy'),1500,0,30),
(7876,'ADAMS','CLERK',7788,to_date('13-7-1987', 'dd-mm-yyyy'),1100,NULL,20),
(7900,'JAMES','CLERK',7698,to_date('3-12-1981','dd-mm-yyyy'),950,NULL,30),
(7902,'FORD','ANALYST',7566,to_date('3-12-1981','dd-mm-yyyy'),3000,NULL,20),
(7934,'MILLER','CLERK',7782,to_date('23-1-1982','dd-mm-yyyy'),1300,NULL,10);

I have also created an idiotic UDF in order to demonstrate my point.

CREATE OR REPLACE FUNCTION f_get(p_empno in INT)
 RETURNS varchar
 LANGUAGE plpgsql
AS $function$
DECLARE
  n1 INT;
BEGIN
  SELECT COUNT(*) INTO n1 FROM EMP WHERE EMPNO = p_empno;
  IF n1 = 0 THEN
          RETURN 'N';
  ELSE
          RETURN 'Y';
  END IF;
  RETURN NULL;
END;
$function$
;
Note that the EMP table is a tiny table, occupying just 1 block.
Here is the query in question:
 
--Before running the query, I have set the track_functions parameter to 'all' to enable tracking of function call counts and time used.

set track_functions to 'all';

SELECT COUNT(COMM)
  FROM EMP
WHERE F_GET(EMPNO) = 'Y';


And here is the execution plan:

Aggregate (actual time=0.134..0.135 rows=1 loops=1)
  Output: count(comm)
  Buffers: shared hit=15
  ->  Seq Scan on scott.emp (actual time=0.038..0.130 rows=14 loops=1)
        Output: empno, ename, job, mgr, hiredate, sal, comm, deptno
        Filter: ((f_get(emp.empno))::text = 'Y'::text)
        Buffers: shared hit=15
Query Identifier: 1659856310246027018
Planning Time: 0.040 ms
Execution Time: 0.150 ms

select * from pg_stat_user_functions;

funcid|schemaname|funcname|calls|total_time|self_time|
------+----------+--------+-----+----------+---------+
 26140|scott         |f_get   |   14|      0.13|     0.13|


The following is a brief interpretation of the plan:

1) The Aggregate node calles the Seq Scan node
2) The Seq Scan node performs a sequential scan on the EMP table.
     The "rows=14" indicates that this scan retrieves 14 rows from the EMP table.
     We can deduce that the UDF f_get is executed 14 times because 14 EMPNOs are supposed to be passed as input parameters.
3) The Seq Scan node passes the 14 rows up to the Aggregate node.
     The Aggregate node does the aggregation operation on the COMM column, outputting the value 4.

Can you spot any inefficiencies in the plan?

The pain point in the plan is that the f_get funtion is being executed 14 times, resulting in 14 block I/Os. Addtionally, the sequential scan on the EMP table incurs one buffer block I/O. Therefore, the total number of block I/Os is 15.
So how will we be able to decrease the block I/O?
If you know the answer, I dare to say, you are an intermediate level developer.

The answer is to reduce the number of the UDF calls by adding a predicate in the WHERE clause. Here I have introduced an extra predicate COMM IS NOT NULL, since we are counting non-null entries of the COMM column.
Before executing the modified query, I have reset the statistics for the f_get function to zero.

select pg_stat_reset_single_function_counters('f_get'::regproc);

The modified query is as follows:

SELECT COUNT(COMM)
  FROM EMP
WHERE F_GET(EMPNO) = 'Y'
  AND COMM IS NOT NULL;

The new execution plan is:

Aggregate (actual time=0.066..0.066 rows=1 loops=1)
  Output: count(comm)
  Buffers: shared hit=5
  ->  Seq Scan on scott.emp (actual time=0.038..0.064 rows=4 loops=1)
        Output: empno, ename, job, mgr, hiredate, sal, comm, deptno
        Filter: ((emp.comm IS NOT NULL) AND ((f_get(emp.empno))::text = 'Y'::text))
        Rows Removed by Filter: 10
        Buffers: shared hit=5
Query Identifier: 7377752709162974183
Planning Time: 0.044 ms
Execution Time: 0.081 ms

select * from pg_stat_user_functions;

|funcid|schemaname|funcname|calls|total_time|self_time|
|------|----------|--------|----         -|----------|---------|
|26,140|scott        |f_get        |4    |0.217     |0.217    |


As ever you have to look at the Filter operation. PostgreSQL called the function when the COMM column is not null. The f_get function was executed only 4 times.

As you can see, the shared hit has decreased from 15 to 5 (mainly because the number of function calls has dropped from 14 to 4), and the run time has dropped from 0.150 ms to 0.081 ms.

Conclusion

In essence, while a UDF is an immensely powerful tool in a database, it requires careful usage to ensure it remains a help, not a hindrance.  It is important to reduce the number of function calls. One way to achieve this is by adding a not-null predicate for nullable columns. The more information you give to PostgreSQL, the more efficient your queries are going to be.

Footnote
If you perform the same test on Oracle 19c, Oracle automatically adds the not-null predicate when the COMM column is nullable. You need to take a hands on approach in PostgreSQL.

   

postgresdba.com