Fixing bad queries and resolving
performance problems can involve hours of research and testing. Mostly
it requires technical expertise and creativity to optimize a query.
However, some query optimization techniques are so widely known that you
can improve query performance without spending much time. Like
memorizing a mathematical formula you can memorize how to fix a query.
And paginating records across large datasets is one of those queries
which are easy to make go fast.
In this note I want to take a moment to release a pagination technique.
It's probably nothing new to most readers of this bulletin board, but
sometimes an old thing presented in a new way offers fresh insight or
comprehension. This is the first ariticle of the mini-series I am going
to write on improving the speed of pagination queries. With this simple
technique you will be showered with applause and accolades by your
collegues.
Here is a little codes to create two demo tables:
create table customer (
cust_no numeric not null,
cust_nm character varying(100),
register_date timestamp(0),
register_dt varchar(8),
cust_status_cd varchar(1),
register_channel_cd varchar(1),
cust_age numeric(3),
active_yn varchar(1),
sigungu_cd varchar(5),
sido_cd varchar(2)
);
insert into customer
select i, chr(65+mod(i,26))||i::text||'CUST_NM'
, current_date - mod(i,10000)
, to_char((current_date - mod(i,10000)),'yyyymmdd') as register_dt
, mod(i,5)+1 as cust_status_cd
, mod(i,3)+1 as register_channel_cd
, trunc(random() * 100) +1 as age
, case when mod(i,22) = 0 then 'N' else 'Y' end as active_yn
, case when mod(i,1000) = 0 then '11007'
when mod(i,1000) = 1 then '11006'
when mod(i,1000) = 2 then '11005'
when mod(i,1000) = 3 then '11004'
when mod(i,1000) = 4 then '11003'
when mod(i,1000) = 5 then '11002'
else '11001' end as sigungu_cd
, case when mod(i,3) = 0 then '01'
when mod(i,3) = 1 then '02'
when mod(i,3) = 2 then '03' end as sido_cd
from generate_series(1,1000000) a(i);
ALTER TABLE customer ADD CONSTRAINT customer_pk
PRIMARY KEY (cust_no);
CREATE INDEX CUSTOMER_X01 ON CUSTOMER(sigungu_cd, register_date, cust_nm);
create table online_order (
ord_no numeric(10,0) not null,
cust_no numeric not null references customer,
ord_date timestamp(0) not null,
ord_dt varchar(8) not null,
ord_status_cd varchar(1) not null,
comment varchar(100)
);
insert into online_order
select i, mod(i,1000000) as cust_no
,current_date - mod(i,1000) as ord_date
,to_char((current_date - mod(i,1000)),'yyyymmdd') as ord_dt
,(mod(i,4) + 1) as ord_status_cd
,lpad('x',100,'x')
from generate_series(1,2000000,2) a(i);
alter table online_order add constraint online_order_pk
primary key (ord_no);
CREATE INDEX ONLINE_ORDER_X01 ON ONLINE_ORDER(CUST_NO);
As an example, imagine that your application is configured to show 10 records per page.
Here is the pagination query we have to make go fast.
select a.cust_no, a.cust_nm, a.register_date, b.ord_date
from customer a
left
join online_order b
on a.cust_no = b.cust_no
where a.register_date > '2021-01-01'::timestamp
order by a.cust_nm
offset 2000 fetch next 10 rows only;
The query is retrieving data of 201th page.
Here is the execution plan obtained from running EXPLAIN (analyze, buffers).
I have disabled the ability to have multiple workers simultaneously
retrieve data for the purpose of simplicity of my explanation before running the query.
set max_parallel_workers_per_gather = 0;
Limit (actual time=446.529..446.533 rows=10 loops=1)
Buffers: shared hit=15067 read=16706
-> Sort (actual time=446.392..446.472 rows=2010 loops=1)
Sort Key: a.cust_nm
Sort Method: top-N heapsort Memory: 359kB
Buffers: shared hit=15067 read=16706
-> Hash Right Join (actual time=139.410..428.517 rows=104200 loops=1)
Hash Cond: (b.cust_no = a.cust_no)
Buffers: shared hit=15067 read=16706
-> Seq Scan on online_order b (actual time=0.043..95.923 rows=1000000 loops=1)
Buffers: shared hit=7198 read=13211
-> Hash (actual time=139.227..139.228 rows=69500 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 5367kB
Buffers: shared hit=7869 read=3495
-> Seq Scan on customer a (actual time=0.016..120.931 rows=69500 loops=1)
Filter: (register_date > '2021-01-01 00:00:00'::timestamp without time zone)
Rows Removed by Filter: 930500
Buffers: shared hit=7869 read=3495
Planning:
Buffers: shared hit=16
Planning Time: 0.291 ms
Execution Time: 446.927 ms
Looking
at the plan structure we can see that the database engine joined two
tables, and sorted the intermediate result set limiting its output to 10
rows. "top-N heapsort" is used if you only want a couple of sorted rows. In the above query it maintains a heap with 10 rows,
which is 359kB in size. It consumes the input values in sequence. After
it fills the heap up to the 10 rows it starts examining each new value
to see if it is larger than all current 10 values. If it is larger than
all current values it gets discarded. If it is smaller than all current
values or than some current values, it is inserted at the appropriate
point in the heap, everything gets shifted down by one, and it bumps the
last entry off the heap.
PostgreSQL
is not sorting 104200 rows by a.cust_nm. The sorting part takes up only 18
milliseconds (446-428) and is therefore almost as fast as reading the
data in the first place.
Even though top-N heapsort is a wonderful mechanism for optimising sorts,
the excution plan is not optimal because it is reading the whole
ONLINE_ORDER table. On top of that PostgreSQL is producing a large
intermediate data set with 104200 rows before retrieving 10 rows. In
the absence of the index on the column REGISTER_DATE PostgreSQL does a
full tablescan on CUSTOMER. You might think creating an index on REGISTER_DATE
will make the query run blazingly fast. But using indexes comes at a
cost. So let's assume we need to make the query go fast without creating
any indexes. How will we improve this query peformance and reduce
resource consumption? I am going to leave this note unfinished to give
readers a little chance to think about how to make the query go faster without creating indexes.
Presumably there is no surefire way to make the query efficient. Every
developer may have their own way of improving performance.
I will update this posting with my ideas some time next month.
Added on Feb 5, 2023
There is a small question which often arises in making a query go fast:
How can I reduce the amount of data joined? "An old man's outdated trickery" can help. To get around the inefficiency of the query I rewrote it as follows:
(If you are a novice developer, you might wonder how I came up with this query. Before you shower me with applause and accolades, please note that the methodology here is not mine. Most intermediate-level developers would write this query without much thought. I am afraid I can not provide a logical explanation of doing this. You need to practice until you can rely on experience and muscle memory.)
SELECT A.CUST_NO, A.CUST_NM, A.REGISTER_DATE, B.ORD_DATE
FROM (SELECT A.CUST_NO, A.CUST_NM, A.REGISTER_DATE
FROM CUSTOMER A
WHERE A.REGISTER_DATE > '2021-01-01'::TIMESTAMP
ORDER BY A.CUST_NM
FETCH NEXT 2010 ROWS ONLY) A
LEFT
JOIN ONLINE_ORDER B
ON A.CUST_NO = B.CUST_NO
ORDER BY CUST_NM
OFFSET 2000 FETCH NEXT 10 ROWS ONLY;
Let's look deep into the execution plan and see what happens:
Limit (actual time=69.194..69.199 rows=10 loops=1)
Buffers: shared hit=3546 read=11203
-> Nested Loop Left Join (actual time=67.935..69.134 rows=2010 loops=1)
Buffers: shared hit=3546 read=11203
-> Limit (actual time=67.915..67.988 rows=652 loops=1)
Buffers: shared hit=160 read=11203
-> Sort (actual time=67.914..67.949 rows=652 loops=1)
Sort Key: a.cust_nm
Sort Method: top-N heapsort Memory: 360kB
Buffers: shared hit=160 read=11203
-> Seq Scan on customer a (actual time=0.057..58.188 rows=71800 loops=1)
Filter: (register_date > '2021-01-01 00:00:00'::timestamp without time zone)
Rows Removed by Filter: 928200
Buffers: shared hit=160 read=11203
-> Index Scan using online_order_x01 on online_order b (actual time=0.001..0.001 rows=2 loops=652)
Index Cond: (cust_no = a.cust_no)
Buffers: shared hit=3386
Planning:
Buffers: shared hit=8
Planning Time: 0.171 ms
Execution Time: 69.223 ms
In the rewritten query, I have deferred the joins until after 652 rows have been fetched from the CUSTOMER table. In this case, I inteded to retrieve 2010 rows from the table, but there were only 652 rows that met the conditions specified in the WHERE clause. Note that only 652 tuples from the CUSTOMER table wre retrieved first, and the planner was able to perform a nested loop join. The query accessed the ONLINE_ORDER_X01 index instead of reading the entire ONLINE_ORDER table, reducing the number of block IOs from (17414 + 14359) to 11394. Furthermore we can see a 4x speed up over the original query. Additionally, since the optimized query decreased the BUFFERS numbers, fewer buffers in the buffer pool will be needed to execute the query, minimizing the risk of buffer contention and freeing up more space in the buffer pool for other tasks.
Conclusion
There is not a one-size-fits-all solution to optimzing a pagination query. Here is the principle I applied to improve the original query:
- Reduce the number of records participating in join operations.