설문조사
PostgreSQL/PPAS 관련 듣고 싶은 교육은


총 게시물 175건, 최근 0 건
   

Using MATERIALIZED and NOT MATERIALIZED

글쓴이 : 모델광 날짜 : 2025-02-09 (일) 12:22 조회 : 339
If you've ever used the WITH clause in an Oracle database, you'are probably aware of the hints /+ materialized */, which forces Oracle to create a local temporary table to hold the results of the WITH clause for subsequent use, and /+ inline */, which instructs the optimizer to copy the text of the WITH clause into the body of the query during the optimization phase.

In this node, I'll explain how we can implement those two hints in PosgreSQL when migrating queries from Oracle to PostgreSQL.

Here are test codes to create sample date:

--The following test was conducted on PostgreSQL 16.3
DROP TABLE IF EXISTS online_order;
CREATE TABLE online_order (
    ord_no BIGINT NOT NULL,
    cust_no INT NOT NULL,
    ord_date DATE NOT NULL,
    ord_dt VARCHAR(8) NOT NULL,
    ord_status_cd VARCHAR(1) NOT NULL,
    comment VARCHAR(100)
);

INSERT INTO online_order
SELECT i, MOD(i, 1000000) AS cust_no,
    '20240515'::date - MOD(i, 1000) AS ord_date,
    TO_CHAR(current_date - MOD(i, 1000), 'yyyymmdd') AS ord_dt,
    (MOD(i, 4) + 1) AS ord_status_cd,
    LPAD('x', 100, 'x')
FROM generate_series(1, 2000000, 2) AS a(i);

CREATE UNIQUE INDEX ONLINE_ORDER_U1 ON ONLINE_ORDER(ord_no);
ANALYZE ONLINE_ORDER;

By default, if the main query references the WITH subquery just once, PostgreSQL inlines the subquery directly into the main query. If the subquery is referenced multiple times, PostgreSQL creates a temporary result set.
We can override this default behavior by specifying MATERIALIZED to force separate calculation of the WITH subquery, or by spcifying NOT MATERIALIZED to force the subquery to be merged into the main query, avoiding materialization overhead.

Example Queries: MATERIALIZED vs. NOT MATERIALIZED
Let's compare how PostgreSQL processes these options.

Using MATERIALIZED
Here is a query using the MATERIALIZED keyword followed by its plan:

WITH W AS MATERIALIZED (
  SELECT ORD_NO, ORD_DATE, ORD_STATUS_CD
    FROM ONLINE_ORDER
   WHERE ORD_NO IS NOT NULL
 )
SELECT A.ORD_NO
     , A.ORD_DATE
     , A.ORD_STATUS_CD
  FROM ONLINE_ORDER A
 WHERE EXISTS (SELECT 1
                 FROM W B
                WHERE A.ORD_NO = B.ORD_NO
                  AND ORD_STATUS_CD = '4'
               )
   AND EXISTS (SELECT 1
                 FROM W C
                WHERE A.ORD_NO = C.ORD_NO
                  AND C.ORD_DATE = '20240331'::DATE
              )
   AND A.CUST_NO BETWEEN 100 AND 200;
 
Nested Loop Semi Join  (cost=51743.93..75965.51 rows=1 width=14) (actual time=5240.218..5240.220 rows=0 loops=1)
  Join Filter: (a.ord_no = c.ord_no)
  Rows Removed by Join Filter: 100000
  Buffers: shared hit=1969252 read=49979, temp read=148601 written=6145
  CTE w
    ->  Seq Scan on online_order  (cost=0.00..29231.00 rows=1000000 width=14) (actual time=0.011..128.084 rows=1000000 loops=1)
          Filter: (ord_no IS NOT NULL)
          Buffers: shared hit=16101 read=3130
  ->  Nested Loop  (cost=22512.92..24172.01 rows=1 width=22) (actual time=553.339..2360.939 rows=50 loops=1)
        Buffers: shared hit=1969252 read=49979, temp read=2101 written=6144
        ->  HashAggregate  (cost=22512.50..22514.50 rows=200 width=8) (actual time=529.410..714.272 rows=500000 loops=1)
              Group Key: b.ord_no
              Batches: 21  Memory Usage: 10305kB  Disk Usage: 11736kB
              Buffers: shared hit=16101 read=3130, temp read=2101 written=6144
              ->  CTE Scan on w b  (cost=0.00..22500.00 rows=5000 width=8) (actual time=0.016..397.783 rows=500000 loops=1)
                    Filter: ((ord_status_cd)::text = '4'::text)
                    Rows Removed by Filter: 500000
                    Buffers: shared hit=16101 read=3130, temp written=2929
        ->  Index Scan using online_order_u1 on online_order a  (cost=0.42..8.29 rows=1 width=14) (actual time=0.003..0.003 rows=0 loops=500000)
              Index Cond: (ord_no = b.ord_no)
              Filter: ((cust_no >= 100) AND (cust_no <= 200))
              Rows Removed by Filter: 1
              Buffers: shared hit=1953151 read=46849
  ->  CTE Scan on w c  (cost=0.00..22500.00 rows=5000 width=8) (actual time=0.027..57.411 rows=2000 loops=50)
        Filter: (ord_date = '2024-03-31'::date)
        Rows Removed by Filter: 998000
        Buffers: temp read=146500 written=1   
Planning Time: 0.162 ms
Execution Time: 5243.249 ms

Here is my interpretation of the execution plan:

1) In the CTE w step, the optimizer accessed 16101 blocks from shared_buffers and 3130 from disk.

2) In the CTE Scan on w b step, the work_mem space was insufficient, spilling 2,929 blocks to disk (written=2929).

3) In the CTE Scan on w c step, the optimizer accessed the CTE 50 times (loops=50) and read 146,500 blocks from the temp file (50 * 2929 = 146,450). The temp file was created when processing the CTE Scan on w b step.


Take node of the 'Disk Usage: 11736kB' in the HashAggregate step, which indicates that the work_mem space was small.
I am not sure where the intermediate result set of the CTE w is stored. I guess when the intermediate result set of CTE is accessed more than twice, it is stored in work_mem area.
When I increased the work_mem to 160MB to verify my assumption, I obsered that there was no temp file I/O during the CTE Scan on W c step.(The test is left for readers as an exercise.)

Using NOT MATERIALIZED
Not Let's see how the optimizer handles the query when we use the NOT MATERIALIZED keyword:

EXPLAIN (ANALYZE, BUFFERS)
WITH W AS NOT MATERIALIZED (
  SELECT ORD_NO, ORD_DATE, ORD_STATUS_CD
    FROM ONLINE_ORDER
   WHERE ORD_NO IS NOT NULL
 )
SELECT A.ORD_NO
     , A.ORD_DATE
     , A.ORD_STATUS_CD
  FROM ONLINE_ORDER A
 WHERE EXISTS (SELECT 1
                 FROM W B
                WHERE A.ORD_NO = B.ORD_NO
                  AND B.ORD_STATUS_CD = '4'
               )
   AND EXISTS (SELECT 1
                 FROM W C
                WHERE A.ORD_NO = C.ORD_NO
                  AND C.ORD_DATE = '20240331'::DATE
              )
   AND A.CUST_NO BETWEEN 100 AND 200;
 
Nested Loop  (cost=1000.85..26790.76 rows=1 width=14) (actual time=50.298..52.972 rows=0 loops=1)
  Join Filter: (a.ord_no = online_order.ord_no)
  Buffers: shared hit=14393 read=5238
  ->  Gather  (cost=1000.42..26783.56 rows=1 width=22) (actual time=50.297..52.971 rows=0 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        Buffers: shared hit=14393 read=5238
        ->  Nested Loop  (cost=0.42..25783.46 rows=1 width=22) (actual time=44.900..44.901 rows=0 loops=3)
              Buffers: shared hit=14393 read=5238
              ->  Parallel Seq Scan on online_order a  (cost=0.00..25481.00 rows=36 width=14) (actual time=28.290..44.846 rows=33 loops=3)
                    Filter: ((cust_no >= 100) AND (cust_no <= 200))
                    Rows Removed by Filter: 333300
                    Buffers: shared hit=13993 read=5238
              ->  Index Scan using online_order_u1 on online_order online_order_1  (cost=0.42..8.40 rows=1 width=8) (actual time=0.001..0.001 rows=0 loops=100)
                    Index Cond: ((ord_no = a.ord_no) AND (ord_no IS NOT NULL))
                    Filter: (ord_date = '2024-03-31'::date)
                    Rows Removed by Filter: 1
                    Buffers: shared hit=400
  ->  Index Scan using online_order_u1 on online_order  (cost=0.42..7.19 rows=1 width=8) (never executed)
        Index Cond: ((ord_no = online_order_1.ord_no) AND (ord_no IS NOT NULL))
        Filter: ((ord_status_cd)::text = '4'::text)
Planning:
  Buffers: shared hit=48
Planning Time: 0.313 ms
Execution Time: 52.994 ms

We can not observe any CTE in the plan, which indicates that the WITH subquery was inlined. When we use the NOT MATERIALIZED keyword, PostgreSQL did not create a temporary result set. Note that the block I/O decreased from (1969252+49979+148601+6145) to (14393+5238) by employing the NOT MATERIALIZED keyword.

Internally, PostgreSQL transformed the above query into the following format:


SELECT A.ORD_NO
, A.ORD_DATE
, A.ORD_STATUS_CD
FROM ONLINE_ORDER A
JOIN ONLINE_ORDER C
ON A.ORD_NO = C.ORD_NO
JOIN ONLINE_ORDER B
ON A.ORD_NO = B.ORD_NO
WHERE A.CUST_NO BETWEEN 100 AND 200
AND C.ORD_DATE = '20240331'::DATE
AND B.ORD_STATUS_CD = '4';


Conclusion
In PostgreSQL, we can control how with subqueries are processed by using MATERIALIZED or NOT MATERIALIZED, effectively mimicking Oracle's hints. By carefully choosing the right option, we can optimize query execution and reduce unnecessary I/O operations, improving performance during Oracle migrations.

   

postgresdba.com