설문조사
PostgreSQL/PPAS 관련 듣고 싶은 교육은


총 게시물 175건, 최근 0 건
 

KEEP DENSE_RANK 2

글쓴이 : 모델광 날짜 : 2025-04-19 (토) 11:28 조회 : 79
While browsing YouTube recently, I came across the following video:
Chongqing , a multilayer city. (youtube.com)

The video caught my attention - not just for the captivating cityscape but for the content creator's well-done impersonation of the most famous man in the world. 
That bit of creative mimicry unexpectedly reminded me of a different kind of impresonation: how PostgreSQL can imitate certain Oracle SQL features that it does not natively support.

This article echoes an optimization technique that I originally described in 2021.
https://www.postgresdba.com/bbs/board.php?bo_table=C05&wr_id=133&sca=&sfl=wr_subject%7C%7Cwr_content&stx=keep&sop=and&page=2

This note is probably nothing new to most readers of this bulletin board. However,
sometimes an old thing presented in a new context can offer fresh insights or enhance comprehension.  In a migration project where I had to port Oracle queries to PostgreSQL queries I encountered an SQL statement of the following format:

--The script to create the test table is at the bottom of this blog post.
select sigungu_cd
     , count(*)
     , max(register_date) as max_reg_date
     , max(cust_age) keep (dense_rank first order by register_date desc)
  from customer
 group by sigungu_cd
 having count(*) > 1;

We can infer that the business requirement would be as follows:
We should retrieve customer statistics per SIGUNGU_CD including:
1) total number of customers in each region
2) the most recent registration date per region
3) the age of the most recently registered cutomer

A simple requirement was solved with a simple SQL statement in Oracle.

The problem is that PostgreSQL does not support KEEP (DENSE_RANK FIRST ORDER BY...) clause directly, but we can achive the same result using a couple of PostgreSQL functionalities without performance degradation.

--Option 1: Use FIRST_VALUE() in a CTE
explain(analyze, buffers, costs off)
with w as (
select sigungu_cd
     , register_date
     , cust_age
  --   , max(register_date) as max_reg_date
     , first_value(cust_age) over (partition by sigungu_cd order by register_date desc, cust_age desc) as srd
  from customer
)
select sigungu_cd
     , count(*)
     , max(register_date) as max_reg_date
     , max(srd)
  from w
 group by sigungu_cd;

GroupAggregate (actual time=3755.010..3757.793 rows=7 loops=1)
  Group Key: customer.sigungu_cd
  Buffers: shared hit=11363, temp read=3773644 written=11492
  ->  WindowAgg (actual time=1043.202..3655.548 rows=1000000 loops=1)
        Buffers: shared hit=11363, temp read=3773644 written=11492
        ->  Sort (actual time=1043.189..1154.233 rows=1000000 loops=1)
              Sort Key: customer.sigungu_cd, customer.register_date DESC, customer.cust_age DESC
              Sort Method: external merge  Disk: 29448kB
              Buffers: shared hit=11363, temp read=7351 written=7366
              ->  Seq Scan on customer (actual time=0.011..82.678 rows=1000000 loops=1)
                    Buffers: shared hit=11363
Planning Time: 0.110 ms
Execution Time: 3760.447 ms


--Option 2: Use an ARRAY to Aggregate
explain(analyze, buffers, costs off)
select sigungu_cd
     , count(*)
     , max(register_date) as max_reg_date
     , (max(array[register_date::text,lpad(cust_age::text,3,'0')]))[2]
  from customer
 group by sigungu_cd
 order by sigungu_cd;

Sort (actual time=352.932..352.933 rows=7 loops=1)
  Sort Key: sigungu_cd
  Sort Method: quicksort  Memory: 25kB
  Buffers: shared hit=11363
  ->  HashAggregate (actual time=352.969..352.970 rows=7 loops=1)
        Group Key: sigungu_cd
        Batches: 1  Memory Usage: 24kB
        Buffers: shared hit=11363
        ->  Seq Scan on customer (actual time=0.008..42.060 rows=1000000 loops=1)
              Buffers: shared hit=11363
Planning Time: 0.063 ms
Execution Time: 353.008 ms


Note that the optimizer chose HashAggregate. And the work_mem usage is only 24kB. The elapsed time is around 353 ms.

--Option 3: Use DISTINCT ON
explain(analyze, buffers, costs off)
select distinct on (sigungu_cd) sigungu_cd
     , count(*) over (partition by sigungu_cd) as sigungu_cd_cnt
     , register_date as max_register_date
     , cust_age
  from customer
 --group by sigungu_cd
 order by sigungu_cd, register_date desc, cust_age desc;


DISTINCT ON (sigungu_cd) ensures one row per sigungu_cd, taking the last register_date.
count(*) over (partition by sigungu_cd) computes total rows per sigungu_cd without aggregation issues.
order by sigungu_cd, register_date desc, cust_age desc ensures the latest row is selected.

Unique (actual time=1016.306..1153.615 rows=7 loops=1)
  Buffers: shared hit=11363, temp read=23686 written=20268
  ->  Incremental Sort (actual time=1016.304..1110.089 rows=1000000 loops=1)
        Sort Key: sigungu_cd, register_date DESC, cust_age DESC
        Presorted Key: sigungu_cd
        Full-sort Groups: 7  Sort Method: quicksort  Average Memory: 28kB  Peak Memory: 28kB
        Pre-sorted Groups: 7  Sort Method: external merge  Average Disk: 36984kB  Peak Disk: 36984kB
        Buffers: shared hit=11363, temp read=23686 written=20268
        ->  WindowAgg (actual time=433.885..544.341 rows=1000000 loops=1)
              Buffers: shared hit=11363, temp read=14442 written=11007
              ->  Sort (actual time=243.128..307.002 rows=1000000 loops=1)
                    Sort Key: sigungu_cd
                    Sort Method: external merge  Disk: 29448kB
                    Buffers: shared hit=11363, temp read=7351 written=7366
                    ->  Seq Scan on customer (actual time=0.009..79.164 rows=1000000 loops=1)
                          Buffers: shared hit=11363
Planning Time: 0.043 ms
Execution Time: 1155.764 ms

We observe that the optimizer used Sort, WindowAgg, Incremental Sort, and Unque node. The work_mem area was not sufficient so that it spilled to disk(Sort Method: external merge). It took around 1155 ms to run the query.

In this test case, the ARRAY method proved more efficient than DISTINCT ON, consuming less memory and requring fewer temp writes. However, performance can vary depending on data distribution and size, so both methods are worth considering.

Conclusion
When we are doing aggregation, one of the most common requirements is once we've aggregated on a particular column we actually want to know information about another column. These kinds of requirements may trap SQL novices.

In Oracle, the KEEP (DENSE_RANK FIRST ..) syntax makes this kind of query both elegant and concise. In PostgreSQL, while there's no direct equivalent, we can replicate the logic using ARRAY, FIRST_VALUE(), or DISTINCT ON functionalities.

Footnote
The following is the script to create the test table:

create table customer (
cust_no         int                not null,
cust_nm        character varying(100),
register_date  timestamp(0),
register_dt      varchar(8),
cust_status_cd varchar(1),
register_channel_cd varchar(1),
cust_age        int,
active_yn       boolean,
sigungu_cd     varchar(5),
sido_cd          varchar(2)
);
insert into customer
select i, chr(65+mod(i,26))||i::text||'CUST_NM'
     , '2024-06-08'::date - mod(i,10000)
     , to_char(('2024-06-08'::date - mod(i,10000)),'yyyymmdd') as register_dt
     , mod(i,5)+1 as cust_status_cd
     , mod(i,3)+1 as register_channel_cd
     , trunc(random() * 100) +1 as age
     , case when mod(i,22) = 0 then false else true end as active_yn
     , case when mod(i,1000) = 0 then '11007'
            when mod(i,1000) = 1 then '11006'
            when mod(i,1000) = 2 then '11005'
            when mod(i,1000) = 3 then '11004'
            when mod(i,1000) = 4 then '11003'
            when mod(i,1000) = 5 then '11002'
            else '11001' end                  as sigungu_cd
      , case when mod(i,3) = 0 then '01'
             when mod(i,3) = 1 then '02'
             when mod(i,3) = 2 then '03' end as sido_cd
  from generate_series(1,1000000) a(i);

ALTER TABLE customer ADD CONSTRAINT customer_pk  PRIMARY KEY (cust_no);

 

postgresdba.com