Last year I worked on a project where I had to port Oracle functions to PostgreSQL functions. The following is one of the code patterns I refactored in order to improve its performance.
I have never worked as a programmer, though, it seems that programming languages do not even have the concept of relational operators and set-based techniques. So engineers are good at developing row-by-row algorithm (former Oacle consultant Tom Kyte called this slow-by-slow algorithm). However, relational databases go against row-by-row processing.
Manipulation of data as a set is essential for dealing with data in a relational database because row-by-row(slow-by-slow) processing does not exploit system resources efficiently. In order to convert serial programs into multi-process parallel programs we need to use good set-based mechanics.
Set-based techniques not only are faster in terms of performance, using less system resources, and more efficient, but it makes your whole business process more robust, and less frightening and take the risk out of manipulating and managing large data sets. It is a case of just learning the right tools to do the right job.
Below are the original Oracle procedure and the ported PostgreSQL function.
The SQL statement itself in the UDF makes no sense from an application point of view, but it should illustrate the performance difference of row-by-row processing and set-based processing.
* Oracle Procedure in an AS-IS database
CREATE OR REPLACE PROCEDURE P_SUM_AMT(
p_empno in number
,v_amt_total out number)
IS
v_amount NUMBER;
v_ord_no NUMBER;
v_amount NUMBER;
BEGIN
v_amt_total := 0 ;
v_amount := 0 ;
CURSOR C1 IS
SELECT ORD_NO
FROM OFFLINE_ORDER
WHERE EMPNO = p_empno;
OPEN C1;
LOOP
FETCH C1 INTO v_ord_no --take a row
EXIT WHEN c1%NOTFOUND;
SELECT NVL(UNIT_PRICE*QUANTY,0) --initiate another query
INTO v_amount
FROM ORD_ITEM
WHERE ORD_NO = v_ord_no;
v_amt_total := v_amt_total + v_amount; --store a vaule summing up
END LOOP;
CLOSE C1;
END;
The inefficiency in the query above is: it is very often a PL/SQL loop. They take a row and then they bacically initiate another query using the contents of that row to get the result. Sometimes they have to store the result in a variable and do another operation, for example, summing up. Some DBAs call this slow-by-slow processing which incurs a lot of block IOs and latch(locking) related wait events.
On the other hand in a query exploiting set-based processing we can get the result in one query. The following is a function which uses a set-based algorithm.
* PostgreSQL Procedure (written on version 13) in a TO-BE database
In PostgreSQL 13, a procedure does not support OUT mode. So I created a function instead of a procedure. Here I created a SQL function, not a PL/pgSQL function because a SQL function is more efficient than a PL/pgSQL function. I will show the performance difference in another post.
CREATE OR REPLACE FUNCTION P_SUM_AMT
(p_empno numeric)
RETURNS NUMERIC
LANGUAGE SQL PARALLEL SAFE
AS
$BODY$
SELECT COALESCE(SUM(AMOUNT),0)
FROM OFFLINE_ORDER A
, LATERAL (SELECT UNIT_PRICE*QUANTITY AS AMOUNT
FROM ORD_ITEM
WHERE ORD_NO = A.ORD_NO) B
WHERE 1=1
AND A.EMPNO = p_empno;
$BODY$
It may be argued that the ported code is harder to write. However, there is huge performance gains. The programmers, once they have learned to exploit the set-based techniques, and they see the performance gains, they can actually start to innovate business processes to run semi-real time, or much faster and get rid of those ludicrous functions with row-by-row FOR ~ LOOP logic that consumes a lot of system resources.
Conclusion
Try to to get the required result in one go. Try to exploit the set-based techniques which are suitable for relational databases. When it comes to operating on data in bulk, row-by-row processing can be a disaster because it consumes a lot of system resources.