The video caught my attention - not just for the captivating cityscape but for the content creator's well-done impersonation of the most famous man in the world.
That bit of creative mimicry unexpectedly reminded me of a different kind of impresonation: how PostgreSQL can imitate certain Oracle SQL features that it does not natively support.
This article echoes an optimization technique that I originally described in 2021. https://www.postgresdba.com/bbs/board.php?bo_table=C05&wr_id=133&sca=&sfl=wr_subject%7C%7Cwr_content&stx=keep&sop=and&page=2
This note is probably nothing new to most readers of this bulletin board. However, sometimes an old thing presented in a new context can offer fresh insights or enhance comprehension. In a migration project where I had to port Oracle queries to PostgreSQL queries I encountered an SQL statement of the following format:
--The script to create the test table is at the bottom of this blog post.
select sigungu_cd , count(*) , max(register_date) as max_reg_date , max(cust_age) keep (dense_rank first order by register_date desc) from customer group by sigungu_cd
having count(*) > 1;
We can infer that the business requirement would be as follows: We should retrieve customer statistics per SIGUNGU_CD including: 1) total number of customers in each region 2) the most recent registration date per region 3) the age of the most recently registered cutomer
A simple requirement was solved with a simple SQL statement in Oracle.
The problem is that PostgreSQL does not support KEEP (DENSE_RANK FIRST ORDER BY...) clause directly, but we can achive the same result using a couple of PostgreSQL functionalities without performance degradation.
--Option 1: Use FIRST_VALUE() in a CTE explain(analyze, buffers, costs off) with w as ( select sigungu_cd , register_date , cust_age -- , max(register_date) as max_reg_date , first_value(cust_age) over (partition by sigungu_cd order by register_date desc, cust_age desc) as srd from customer ) select sigungu_cd , count(*) , max(register_date) as max_reg_date , max(srd) from w group by sigungu_cd;
--Option 2: Use an ARRAY to Aggregate explain(analyze, buffers, costs off) select sigungu_cd , count(*) , max(register_date) as max_reg_date , (max(array[register_date::text,lpad(cust_age::text,3,'0')]))[2] from customer group by sigungu_cd order by sigungu_cd;
Buffers: shared hit=11363 -> HashAggregate (actual time=352.969..352.970 rows=7 loops=1) Group Key: sigungu_cd Batches: 1 Memory Usage: 24kB Buffers: shared hit=11363 -> Seq Scan on customer (actual time=0.008..42.060 rows=1000000 loops=1) Buffers: shared hit=11363 Planning Time: 0.063 ms Execution Time: 353.008 ms
Note that the optimizer chose HashAggregate. And the work_mem usage is only 24kB. The elapsed time is around 353 ms.
--Option 3: Use DISTINCT ON explain(analyze, buffers, costs off) select distinct on (sigungu_cd) sigungu_cd , count(*) over (partition by sigungu_cd) as sigungu_cd_cnt , register_date as max_register_date , cust_age from customer --group by sigungu_cd order by sigungu_cd, register_date desc, cust_age desc; DISTINCT ON (sigungu_cd) ensures one row per sigungu_cd, taking the last register_date. count(*) over (partition by sigungu_cd) computes total rows per sigungu_cd without aggregation issues. order by sigungu_cd, register_date desc, cust_age desc ensures the latest row is selected.
We observe that the optimizer used Sort, WindowAgg, Incremental Sort, and Unque node. The work_mem area was not sufficient so that it spilled to disk(Sort Method: external merge). It took around 1155 ms to run the query.
In this test case, the ARRAY method proved more efficient than DISTINCT ON, consuming less memory and requring fewer temp writes. However, performance can vary depending on data distribution and size, so both methods are worth considering.
Conclusion When we are doing aggregation, one of the most common requirements is once we've aggregated on a particular column we actually want to know information about another column. These kinds of requirements may trap SQL novices.
In Oracle, the KEEP (DENSE_RANK FIRST ..) syntax makes this kind of query both elegant and concise. In PostgreSQL, while there's no direct equivalent, we can replicate the logic using ARRAY, FIRST_VALUE(), or DISTINCT ON functionalities.
Footnote
The following is the script to create the test table:
create table customer ( cust_no int not null, cust_nm character varying(100), register_date timestamp(0), register_dt varchar(8), cust_status_cd varchar(1), register_channel_cd varchar(1), cust_age int, active_yn boolean, sigungu_cd varchar(5), sido_cd varchar(2) ); insert into customer select i, chr(65+mod(i,26))||i::text||'CUST_NM' , '2024-06-08'::date - mod(i,10000) , to_char(('2024-06-08'::date - mod(i,10000)),'yyyymmdd') as register_dt , mod(i,5)+1 as cust_status_cd , mod(i,3)+1 as register_channel_cd , trunc(random() * 100) +1 as age , case when mod(i,22) = 0 then false else true end as active_yn , case when mod(i,1000) = 0 then '11007' when mod(i,1000) = 1 then '11006' when mod(i,1000) = 2 then '11005' when mod(i,1000) = 3 then '11004' when mod(i,1000) = 4 then '11003' when mod(i,1000) = 5 then '11002' else '11001' end as sigungu_cd , case when mod(i,3) = 0 then '01' when mod(i,3) = 1 then '02' when mod(i,3) = 2 then '03' end as sido_cd from generate_series(1,1000000) a(i);
ALTER TABLE customer ADD CONSTRAINT customer_pk PRIMARY KEY (cust_no);