설문조사
PostgreSQL/PPAS 관련 듣고 싶은 교육은


총 게시물 162건, 최근 0 건
   

Correlated Subquery Collapse

글쓴이 : 모델광 날짜 : 2022-05-28 (토) 21:47 조회 : 1265

Last year, I worked on a project where I had to verify that ported SQL statements from Oracle are compatible with PostgreSQL. There were a lot of "mysterious" performance problems in the ported queries. And Most of those problems turned out to be a badly chosen query execution plan. After working on this project, I realized that if we do not understand how PostgreSQL planner works we can be in a sticky situation after migrating data from Oracle.

In this note, I will provide a query which produces a totally different execution plan in PostgreSQL and supply a solution to make the query as fast as in Oracle.

I have run up some test scripts to examine how PostgreSQL handles a correlated subquery.


CREATE TABLE employee (
    empno numeric(5,0) NOT NULL,
    ename character varying(10),
    job character varying(9),
    mgr numeric(5,0),
    hiredate timestamp(0),
    sal numeric(7,2) NOT NULL,
    comm numeric(7,2) NOT NULL,
    deptno numeric(2,0),
    sido_nm character varying(100)
);
insert into employee
select i, chr(65+mod(i,26))||i::text||'NM'
      ,case when mod(i,10000)=0 then 'PRESIDENT'
            when mod(i,1000) = 0 then 'MANAGER'
            when mod(i,3)=0 then 'SALESMAN'
            when mod(i,3)=1 then 'ANALYST'
            when mod(i,3)=2 then 'CLERK'
        end as job
      ,case when mod(i,10000)= 0 then null
            when mod(i,1000)= 1 then 10000
            when i >= 9000 then 1000
            else ceiling((i+1000)/1000)*1000
        end as mgr
      , current_date - i
      , trunc(random() * 10000) as sal
      , trunc(random() * 10000) as com
      , mod(i,12)+1            as deptno
      , case when mod(i,3) = 0 then 'Jeonbuk'
            when mod(i,3) = 1 then 'Kangwon'
            else                  'Chungnam'
        end as sido_nm
from generate_series(1,10000) a(i);
                           
ALTER TABLE employee ADD CONSTRAINT employee_pk
  PRIMARY KEY (empno);


The following query reports all employees who earned more than the average for their department, where earnings is calculated as the sum of salary and commission.


SELECT EMPNO, SAL, COMM, DEPTNO
  FROM EMPLOYEE A
 WHERE (A.SAL+A.COMM) >
                                 (SELECT AVG(B.SAL+COMM)
                                    FROM EMPLOYEE B
                                   WHERE A.DEPTNO = B.DEPTNO
                                  )
ORDER BY A.DEPTNO, A.EMPNO;


In Oracle I could get the result in 3 or 4 seconds. But in PostgreSQL with the same amount of data I could get the result in about 13 seconds. In order to solve this problem you should check the execution plan. The following is the execution plan in PostgreSQL.


Sort (actual time=13661.291..13661.432 rows=4979 loops=1)
  Sort Key: a.deptno, a.empno
  Sort Method: quicksort  Memory: 893kB
  Buffers: shared hit=1100113
  ->  Seq Scan on employee a (actual time=2.617..13649.060 rows=4979 loops=1)
        Filter: ((sal + comm) > (SubPlan 1))
        Rows Removed by Filter: 5021
        Buffers: shared hit=1100110
        SubPlan 1
          ->  Aggregate (actual time=1.363..1.363 rows=1 loops=10000)
                Buffers: shared hit=1100000
                ->  Seq Scan on employee b (actual time=0.002..1.220 rows=833 loops=10000)
                      Filter: (a.deptno = deptno)
                      Rows Removed by Filter: 9167
                      Buffers: shared hit=1100000
Planning Time: 0.079 ms
Execution Time: 13661.574 ms


We can observe that PostgreSQL did not collapse the subquery. So it had to perform a sequential scan on the table EMPLOYEE 10,000 times which led to poor performance. It took just 1.363 ms to run the correlated subquery, but repetitive execution of the subquery took 13630 ms (13630=1.363 * 10,000). I do not know any technique to force PostgreSQL to collapse the subquery. 


If we run the same query in Oracle, the logical optimizer of Oracle transforms the SQL statement like this:


--I got this query after running a 10053 trace event and examining the contents of the trace file.


SELECT EMPNO, SAL, COMM, DEPTNO
  FROM (
        SELECT A.EMPNO, A.SAL, A.COMM, A.DEPTNO, AVG(B.COMM +B.SAL) AS AVG_VAL
          FROM EMPLOYEE A, EMPLOYEE B
          WHERE A.DEPTNO = B.DEPTNO
        GROUP BY A.DEPTNO, A.EMPNO, A.SAL, A.COMM
       ) X
 WHERE SAL + COMM > AVG_VAL
ORDER BY DEPTNO, EMPNO;


Firstly the logical optimizer collapses the corellated subquery, so the subquery becomes an inline view. Secondly the resulting inline view merges with the main query. What a beautiful optimizer! It is like an artificial intelligence is working in Oracle.


If we run the transformed query in PostgreSQL, we get the following execution plan.


Sort (actual time=3315.641..3316.060 rows=4979 loops=1)
  Sort Key: x.deptno, x.empno
  Sort Method: quicksort  Memory: 581kB
  Buffers: shared hit=220
  ->  Subquery Scan on x (actual time=3304.109..3310.219 rows=4979 loops=1)
        Buffers: shared hit=220
        ->  HashAggregate (actual time=3304.108..3309.858 rows=4979 loops=1)
              Group Key: a.empno
              Filter: ((a.sal + a.comm) > avg((b.comm + b.sal)))
              Batches: 1  Memory Usage: 6545kB
              Rows Removed by Filter: 5021
              Buffers: shared hit=220
              ->  Hash Join (actual time=2.237..1148.766 rows=8333336 loops=1)
                    Hash Cond: (a.deptno = b.deptno)
                    Buffers: shared hit=220
                    ->  Seq Scan on employee a (actual time=0.008..2.944 rows=10000 loops=1)
                          Buffers: shared hit=110
                    ->  Hash (actual time=2.199..2.200 rows=10000 loops=1)
                          Buckets: 16384  Batches: 1  Memory Usage: 587kB
                          Buffers: shared hit=110
                          ->  Seq Scan on employee b (actual time=0.003..1.137 rows=10000 loops=1)
                                Buffers: shared hit=110
Planning:
  Buffers: shared hit=2
Planning Time: 0.187 ms
Execution Time: 3316.562 ms


We can notice that the number of block IOs dropped from 1100113 to 220. So under PostgreSQL 14 we have to be able to transform the query on our own. If your query is relatively slow in PostgreSQL after migration, you should check the execution plan and pinpoint the problematic operation in the plan. In some cases you should tweak the query to remove the problematic operation in the executioin plan.

For your reference PostgreSQL does not provide SORT GROUP BY operation. It has to perform HashAggregate and then perform Sort.


Let the query visit a table only once

If we take a closer look at the transformed SQL statement, we can notice that the query is touching the table EMPLOYEE twice to get the resulting dataset. You had better touch a table once to imrove performance. So as a quick "what if" test I tried using an analytic functioin as follows:


SELECT *
  FROM (
       SELECT EMPNO, SAL, COMM, DEPTNO, AVG(SAL+COMM) OVER (PARTITION BY DEPTNO) AS AVG_VAL
         FROM EMPLOYEE
       ) A
 WHERE (SAL+ COMM) > AVG_VAL
ORDER BY A.DEPTNO, A.EMPNO;


Sort (actual time=25.412..25.581 rows=4979 loops=1)
  Sort Key: a.deptno, a.empno
  Sort Method: quicksort  Memory: 581kB
  Buffers: shared hit=110
  ->  Subquery Scan on a (actual time=10.749..20.021 rows=4979 loops=1)
        Filter: ((a.sal + a.comm) > a.avg_val)
        Rows Removed by Filter: 5021
        Buffers: shared hit=110
        ->  WindowAgg (actual time=10.747..17.564 rows=10000 loops=1)
              Buffers: shared hit=110
              ->  Sort (actual time=10.481..11.202 rows=10000 loops=1)
                    Sort Key: employee.deptno
                    Sort Method: quicksort  Memory: 1166kB
                    Buffers: shared hit=110
                    ->  Seq Scan on employee (actual time=0.022..4.341 rows=10000 loops=1)
                          Buffers: shared hit=110
Planning Time: 0.387 ms
Execution Time: 25.792 ms


Note that the number of block IOs has fallen from 220 to 110 and we have gotten up to 100 times performance increase (3316 -> 25 ms) in terms of execution time. I woud not argue that analytic functions do always help performance, but in this particular case it worked.


Conclusion

It is easy to overlook the fact that the execution plan of a query can matter for query performance. When you move your database from Oracle to PostgreSQL, you should check queries involving correlated subquries. One of the more prominent limitations of the PostgreSQL optimizer is the fact that PostgreSQL is not so good at collapsing a correlated subquery, which may have a significant impact on performance and it may not be possible to bypass the problem unless you amend the query to collapse the subquery.


Added on JULY 18, 2023

When I wrote this note, I believed that the last query was the optimal choice. However, I regret to say that the query I presented in this note was not the most efficient. Recently I have discovered  a different strategy that achieves better performance.

You are kindly advised to read this updated note.


   

postgresdba.com