설문조사
PostgreSQL/PPAS 관련 듣고 싶은 교육은


총 게시물 126건, 최근 0 건
   

Pagination query using an outer join

글쓴이 : 모델광 날짜 : 2022-12-03 (토) 00:07 조회 : 207

Fixing bad queries and resolving performance problems can involve hours of research and testing. Mostly it requires technical expertise and creativity to optimize a query. However, some query optimization techniques are so widely known that you can improve query performance without spending much time. Like memorizing a mathematical formula you can memorize how to fix a query. And paginating records across large datasets is one of those queries which are easy to make go fast.

In this note I want to take a moment to release a pagination technique. It's probably nothing new to most readers of this bulletin board, but sometimes an old thing presented in a new way offers fresh insight or comprehension. This is the first ariticle of the mini-series I am going to write on improving the speed of pagination queries. With this simple technique you will be showered with applause and accolades by your collegues.


Here is a little codes to create two demo tables:

create table customer (
cust_no                numeric not null,
cust_nm                character varying(100),
register_date         timestamp(0),
register_dt             varchar(8),
cust_status_cd        varchar(1),
register_channel_cd varchar(1),
cust_age                 numeric(3),
active_yn                varchar(1),
sigungu_cd             varchar(5),
sido_cd                  varchar(2)
);
insert into customer
select i, chr(65+mod(i,26))||i::text||'CUST_NM'
     , current_date - mod(i,10000)
     , to_char((current_date - mod(i,10000)),'yyyymmdd') as register_dt
     , mod(i,5)+1 as cust_status_cd
     , mod(i,3)+1 as register_channel_cd
     , trunc(random() * 100) +1 as age
     , case when mod(i,22) = 0 then 'N' else 'Y' end as active_yn
     , case when mod(i,1000) = 0 then '11007'
            when mod(i,1000) = 1 then '11006'
            when mod(i,1000) = 2 then '11005'
            when mod(i,1000) = 3 then '11004'
            when mod(i,1000) = 4 then '11003'
            when mod(i,1000) = 5 then '11002'
            else '11001' end                  as sigungu_cd
      , case when mod(i,3) = 0 then '01'
             when mod(i,3) = 1 then '02'
             when mod(i,3) = 2 then '03' end as sido_cd
  from generate_series(1,1000000) a(i);
ALTER TABLE customer ADD CONSTRAINT customer_pk
  PRIMARY KEY (cust_no);
CREATE INDEX CUSTOMER_X01 ON CUSTOMER(sigungu_cd, register_date, cust_nm);

create table online_order (
ord_no numeric(10,0)      not null,
cust_no numeric              not null references customer,
ord_date timestamp(0)    not null,
ord_dt   varchar(8)          not null,
ord_status_cd varchar(1) not null,
comment  varchar(100)
);
insert into online_order
select i, mod(i,1000000) as cust_no
      ,current_date - mod(i,1000) as ord_date
      ,to_char((current_date - mod(i,1000)),'yyyymmdd') as ord_dt
      ,(mod(i,4) + 1) as ord_status_cd
      ,lpad('x',100,'x')
  from generate_series(1,2000000,2) a(i);
alter table online_order add constraint online_order_pk
primary key (ord_no);
CREATE INDEX ONLINE_ORDER_X01 ON ONLINE_ORDER(CUST_NO);

As an example, imagine that your application is configured to show 10 records per page.

Here is the pagination query we have to make go fast.

select a.cust_no, a.cust_nm, a.register_date, b.ord_date
  from customer a
  left
  join online_order b
    on a.cust_no = b.cust_no
  where a.register_date > '2021-01-01'::timestamp
order by a.cust_nm
offset 2000 fetch next 10 rows only;

The query is retrieving data of 201th page.

Here is the execution plan obtained from running EXPLAIN (analyze, buffers).

I have disabled the ability to have multiple workers simultaneously retrieve data for the purpose of simplicity of my explanation before running the query.

set max_parallel_workers_per_gather = 0;

Limit (actual time=446.529..446.533 rows=10 loops=1)
  Buffers: shared hit=15067 read=16706
  ->  Sort (actual time=446.392..446.472 rows=2010 loops=1)
        Sort Key: a.cust_nm
        Sort Method: top-N heapsort  Memory: 359kB
        Buffers: shared hit=15067 read=16706
        ->  Hash Right Join (actual time=139.410..428.517 rows=104200 loops=1)
              Hash Cond: (b.cust_no = a.cust_no)
              Buffers: shared hit=15067 read=16706
              ->  Seq Scan on online_order b (actual time=0.043..95.923 rows=1000000 loops=1)
                    Buffers: shared hit=7198 read=13211
              ->  Hash (actual time=139.227..139.228 rows=69500 loops=1)
                    Buckets: 131072  Batches: 1  Memory Usage: 5367kB
                    Buffers: shared hit=7869 read=3495
                    ->  Seq Scan on customer a (actual time=0.016..120.931 rows=69500 loops=1)
                          Filter: (register_date > '2021-01-01 00:00:00'::timestamp without time zone)
                          Rows Removed by Filter: 930500
                          Buffers: shared hit=7869 read=3495
Planning:
  Buffers: shared hit=16
Planning Time: 0.291 ms
Execution Time: 446.927 ms


Looking at the plan structure we can see that the database engine joined two tables, and sorted the intermediate result set limiting its output to 10 rows.   "top-N heapsort" is used if you only want a couple of sorted rows. In the above query it maintains a heap with 10 rows, which is 359kB in size. It consumes the input values in sequence. After it fills the heap up to the 10 rows it starts examining each new value to see if it is larger than all current 10 values. If it is larger than all current values it gets discarded. If it is smaller than all current values or than some current values, it is inserted at the appropriate point in the heap, everything gets shifted down by one, and it bumps the last entry off the heap.

PostgreSQL is not sorting 104200 rows by a.cust_nm. The sorting part takes up only 18 milliseconds (446-428) and is therefore almost as fast as reading the data in the first place.

Even though top-N heapsort is a wonderful mechanism for optimising sorts, the excution plan is not optimal because it is reading the whole ONLINE_ORDER table. On top of that PostgreSQL is producing a large intermediate data set with 104200 rows before retrieving 10 rows. In the absence of the index on the column REGISTER_DATE PostgreSQL does a full tablescan on CUSTOMER. You might think creating an index on REGISTER_DATE will make the query run blazingly fast. But using indexes comes at a cost. So let's assume we need to make the query go fast without creating any indexes. How will we improve this query peformance and reduce resource consumption? I am going to leave this note unfinished to give readers a little chance to think about how to make the query go faster without creating indexes. Presumably there is no surefire way to make the query efficient. Every developer may have their own way of improving performance.

I will update this posting with my ideas some time next month.



   

postgresdba.com