설문조사
PostgreSQL/PPAS 관련 듣고 싶은 교육은


총 게시물 169건, 최근 0 건
   

Filter Subquery Approach vs. Anti-Join Approach

글쓴이 : 모델광 날짜 : 2024-07-28 (일) 00:10 조회 : 290
In a recent Oracle SQL tuning project, I optimized a query for better performance. Below is the refined version of the Oracle query I worked on. I was curious to see how this query would perform in PostgreSQL 15. To investigate, I will examine several queries using the EMP, SAL_HIST, and WORK_HIST tables.

The first query utilizes the 'not exists A or not exists B' structure:
(I have placed the code to generate the test data at the end of this article. If you want to carry out the experiment, please refer to the Test Code section of this article.)

SELECT EMPNO
FROM EMP A
WHERE DEPT_NO = 10
  AND (
    NOT EXISTS (
      SELECT 1
      FROM SAL_HIST B
      WHERE A.EMPNO = B.EMPNO
        AND B.SAL > 1000
    )
    OR
    NOT EXISTS (
      SELECT 1
      FROM WORK_HIST C
      WHERE A.EMPNO = C.EMPNO
        AND C.WORK_DUR > 24
    )
  );

Here is the execution plan generated by the EXPLAIN command:

 Seq Scan on emp a  (cost=0.00..308.48 rows=2 width=4)
  Filter: ((deptno = 10) AND ((NOT (hashed SubPlan 2)) OR (NOT (hashed SubPlan 4))))
  SubPlan 2
    ->  Seq Scan on sal_hist b  (cost=0.00..38.25 rows=753 width=4)
          Filter: (sal > 1000)
  SubPlan 4
    ->  Seq Scan on work_hist c  (cost=0.00..38.25 rows=753 width=4)
          Filter: (work_dur > 24)


First, SubPlan 2 scans the sal_hist table and filters rows where sal > 1000. And then it builds a hash table with the filtered results.
Next, SubPlan 4 scans the work_hist table, filters rows where work_dur > 24. and builds a hash table with the filtered results.
Finally, the Seq Scan on emp node scans the emp table, filtering rows where deptno = 10. And then it probes the hash tables generated by SubPlan 2 and SubPlan 4 for each row in the emp table.

The use of hased subplans is efficient because the subquery results are computed once and stored in memory(work_mem). However, I have concerns about hashed subplans since they don't spill over to disk. If the intermediate result set exceeds the allocated work_mem, the query will use more memory than intended, which could potentially lead to an Out Of Memory (OOM) error. Therefore, even if the performance of the query is acceptable, it is prudent to rewrite it to avoid these risks.

In the queyr we have a compound predicate:
not exists A or not exists B

and this is logically equivalent to
not (exists A and exists B)

I have rewritten the query to reflect this equivalence expecting that the optimizer might find a different, better way of executing it:

SELECT EMPNO
 FROM EMP A
WHERE DEPTNO = :P_DEPT_NO
  AND NOT (
    EXISTS (
      SELECT 1
      FROM SAL_HIST B
      WHERE A.EMPNO = B.EMPNO
        AND B.SAL > 1000
    )
    AND
    EXISTS (
      SELECT 1
      FROM WORK_HIST C
      WHERE A.EMPNO = C.EMPNO
        AND C.WORK_DUR > 24
    )
  );


Here is the execution plan for the revised query:

 Seq Scan on emp a  (cost=0.00..308.48 rows=2 width=4)
  Filter: ((deptno = 10) AND ((NOT (hashed SubPlan 2)) OR (NOT (hashed SubPlan 4))))
  SubPlan 2
    ->  Seq Scan on sal_hist b  (cost=0.00..38.25 rows=753 width=4)
          Filter: (sal > 1000)
  SubPlan 4
    ->  Seq Scan on work_hist c  (cost=0.00..38.25 rows=753 width=4)
          Filter: (work_dur > 24)


It might be possible that with different data volumes you'd get different execution plans, but in PostgreSQL 15 the optimizer transformed my query into the original form - in other words it recognized the equivalece of "not (A and B)" and transformed it back to "not A or not B". Technically the optimizer posesses the capability to employ an anti-join when confronted with a NOT EXISTS clause. However, there are cases where this does not appear to be allowed even when we - the users - can see that it might be a very effective strategy.

So, can we force the optimizer to pull off an anti-join with a NOT EXISTS clause in this case?
Since I've asked the question you have probably guessed that the answer is affirmative. Let's examine the following alternative SQL statement, which achieves the same result:

SELECT EMPNO
  FROM EMP A
 WHERE DEPTNO = 10
   AND NOT EXISTS (SELECT 1
                     FROM SAL_HIST B, WORK_HIST C
                    WHERE B.EMPNO = C.EMPNO
                      AND B.SAL > 1000
                      AND C.WORK_DUR > 24
                      AND A.EMPNO = B.EMPNO);


Note that I have managed to transform the two filter subqueris into a single join subquery. Here is the plan I obtained on PostgreSQL 15.1:

Merge Anti Join  (cost=148.60..214.66 rows=2 width=4)
  Merge Cond: (a.empno = b.empno)
  ->  Index Scan using pk_emp on emp a  (cost=0.14..12.38 rows=3 width=4)
        Filter: (deptno = 10)
  ->  Merge Join  (cost=148.46..194.75 rows=2835 width=4)
        Merge Cond: (b.empno = c.empno)
        ->  Sort  (cost=74.23..76.11 rows=753 width=4)
              Sort Key: b.empno
              ->  Seq Scan on sal_hist b  (cost=0.00..38.25 rows=753 width=4)
                    Filter: (sal > 1000)
        ->  Sort  (cost=74.23..76.11 rows=753 width=4)
              Sort Key: c.empno
              ->  Seq Scan on work_hist c  (cost=0.00..38.25 rows=753 width=4)
                    Filter: (work_dur > 24)


Surprisingly, the optimizer transformed a single subquery into an anti-jon, showcasing its flexibility.​

If you think about the query's underlying purpose (prompted, perhpas, by the logical equivalence descibed earlier) you might rephrase the question as "find the EMPNOs that fail to appear in all two related tables" or "work out the EMPNOs that appear in one of the two related tables - those are the EMPNOs I want to exclude." The latter interpretation is exactly what the third query achieves: it joins the two tables and then applies the "not exists" or anti-join against the result.

We can actually take a different approach (which might, or might not, be efficient - depending on the specific data and indexes). We can translate the question into the following:

SELECT EMPNO
  FROM EMP A
 WHERE DEPTNO = :P_DEPT_NO
 EXCEPT (
           SELECT EMPNO
             FROM SAL_HIST B
          WHERE B.SAL > 1000
          INTERSECT
         SELECT EMPNO
            FROM WORK_HIST C
         WHERE C.WORK_DUR > 24
            )
       ;


The subquery using the INTERSECT operator gets me the list of EMPNOs that appear in both tables. The EXCEPT operator then takes the list of EMPNOs with the correct DEPTNO values and excludes those that are found in the subquery - giving me the result set I want.

Here is the execution plan I obtained:

HashSetOp Except  (cost=0.00..107.55 rows=3 width=8)
  ->  Append  (cost=0.00..107.05 rows=201 width=8)
        ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..1.21 rows=3 width=8)
              ->  Seq Scan on emp a  (cost=0.00..1.18 rows=3 width=4)
                    Filter: (deptno = 10)
        ->  Result  (cost=0.00..104.84 rows=198 width=8)
              ->  HashSetOp Intersect  (cost=0.00..102.86 rows=198 width=8)
                    ->  Append  (cost=0.00..99.09 rows=1506 width=8)
                          ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..45.78 rows=753 width=8)
                                ->  Seq Scan on sal_hist b  (cost=0.00..38.25 rows=753 width=4)
                                      Filter: (sal > 1000)
                          ->  Subquery Scan on "*SELECT* 3"  (cost=0.00..45.78 rows=753 width=8)
                                ->  Seq Scan on work_hist c  (cost=0.00..38.25 rows=753 width=4)
                                      Filter: (work_dur > 24)


Conclusion
I am not claiming that an anti-join approach will be more efficient than a filter subquery, or vice versa; performance ultimately depends on the data volume and patterns, taking into account the available indexes.

My goal here wasn't to show state of the art code, but rather show other possible ways to quickly rewrite queries when you are dealing with performance issues. If you are unfortunately confronted with a slow-performing query, I hope this note serves as a helpful starting point for devising your own dedicated solution.

Footnote
In Oracle, the filter operation can be more efficient due to the subquery caching mechanism. Given that PostgreSQL doesn't possess this feature, opting for an anti-join strategy is likely a prudent choice.

Test Code
Here are example DDL statements for setting up test tables, along with corresponding INSERT statements to fill them with dample data:

​-- DDL for EMP table
CREATE TABLE EMP (
    EMPNO INT PRIMARY KEY,
    DEPT_NO INT,
    -- Add other columns as needed
);

-- DDL for SAL_HIST table
CREATE TABLE SAL_HIST (
    EMPNO INT,
    SAL INT,
    -- Add other columns as needed
    FOREIGN KEY (EMPNO) REFERENCES EMP(EMPNO)
);


-- DDL for WORK_HIST table
CREATE TABLE WORK_HIST (
    EMPNO INT,
    WORK_DUR INT,
    -- Add other columns as needed
    FOREIGN KEY (EMPNO) REFERENCES EMP(EMPNO)
);

​-- INSERT statements for sample data in EMP table
INSERT INTO EMP (EMPNO, DEPT_NO) VALUES
(1, 101),
(2, 102),
(3, 101),
-- Add more rows as needed;


-- INSERT statements for sample data in SAL_HIST table
INSERT INTO SAL_HIST (EMPNO, SAL) VALUES
(1, 1200),
(2, 900),
(3, 1100),
-- Add more rows as needed;

​-- INSERT statements for sample data in WORK_HIST table
INSERT INTO WORK_HIST (EMPNO, WORK_DUR) VALUES
(1, 30),
(2, 20),
(3, 25),
-- Add more rows as needed;

Added on Aug 18, 2024
A postgreSQL DBA emailed me, suggesting that we could further optimize the query. Below is the rewritten version of the example query, followed by its execution plan:

SELECT EMPNO
FROM EMP A
WHERE DEPT_NO = 10
AND NOT EXISTS (SELECT 1
FROM SAL_HIST B
WHERE A.EMPNO = B.EMPNO
AND B.SAL > 1000
AND EXISTS (SELECT 1
FROM WORK_HIST C
WHERE B.EMPNO = C.EMPNO
AND C.WORK_DUR > 24
)
);

Nested Loop Anti Join (cost=1.06..3.17 rows=1 width=4)
Join Filter: (a.empno = b.empno)
-> Seq Scan on emp a (cost=0.00..1.04 rows=1 width=4)
Filter: (dept_no = 10)
-> Hash Semi Join (cost=1.06..2.12 rows=1 width=4)
Hash Cond: (b.empno = c.empno)
-> Seq Scan on sal_hist b (cost=0.00..1.04 rows=2 width=4)
Filter: (sal > 1000)
-> Hash (cost=1.04..1.04 rows=2 width=4)
-> Seq Scan on work_hist c (cost=0.00..1.04 rows=2 width=4)
Filter: (work_dur > 24)

The execution plan reveals that the postgreSQL optimizer can collapse a subquery
that is nested within another subquery, a capability that is not available in Oracle.
In Oracle, nested subqueries can not be unnested, meaning that subqueries within
subquries must be filtered separately.

   

postgresdba.com