설문조사
PostgreSQL/PPAS 관련 듣고 싶은 교육은


총 게시물 166건, 최근 1 건
   

Boundary Condition Issue

글쓴이 : 모델광 날짜 : 2024-03-17 (일) 22:32 조회 : 382
During a database migration project from Oracle to PostgreSQL, I frequently encounter performance issues caused by different optimzer functionalites between those databases. Here is an insight from my migration experience, focusing on how these databases handle query plans differently.

Let's dive into a practical scenario with a table named "T1', holding records across 50 random days starting from November 1st, 2023, and see how PostgreSQL's query plans evolve over time, comparing it with Oracle's approach.

Below is the script I have created to illustrate my point. I have conducted the test on PostgreSQL 15.1:

CREATE TABLE t1 AS
SELECT '20231101'::date + round((50*random())::int)::int as reg_dt
FROM generate_series(1,1000000);

ANALYZE t1;

SELECT reg_dt, count(*)
FROM t1
GROUP BY reg_dt;

reg_dt    |count|
----------+-----+
2023-11-01| 9895|
2023-11-02|20194|
2023-11-03|19842|
2023-11-04|20155|
.......
2023-12-16|19902|
2023-12-17|20103|
2023-12-18|20062|
2023-12-19|19755|
2023-12-20|20038|
2023-12-21|10155|

This setup scatters 1,000,000 rows over 50 days, averaging around 15,000 rows per day from November 1st to December 21st, 2023.
Let's assume that as time goes by, about 15,000 rows are added per day. Today is 21st of December and we query for tuples from the day onwards:

SELECT *
FROM t1
WHERE reg_dt >= '20231221'::date;

And here is the execution plan. Take a note of the estimated number of rows.

Gather  (cost=1000.00..11718.33 rows=10300 width=4)                  
  Workers Planned: 2                                                 
  ->  Parallel Seq Scan on t1  (cost=0.00..9688.33 rows=4292 width=4)
        Filter: (reg_dt >= '2023-12-21'::date) 

The actual count stands at 10,155, yet the optimizer predicted we would fetch 10,300 rows. You might wonder, how did it come up with 10,300?

select b.*
  from pg_stats a cross join lateral
       unnest(a.most_common_vals::text::text[], a.most_common_freqs::text::numeric[]) as b(mcv, mcf)
 where tablename = 't1'
order by b.mcv;


mcv       |mcf        
----------+-----------
2023-11-01|0.010066667
2023-11-02|0.019833334
2023-11-03|0.019266667
....
....
2023-12-19|0.020466667
2023-12-20|0.021766666
2023-12-21|0.0103



The planner just multiplied the most common frequency value on 2023-12-21 by the total number of rows:

100,000 * 0.0`103 = 10300

So far, so good. Let's suppose that the following day, 2023-12-22, we inserted an additional 10,000 rows.

INSERT INTO t1
SELECT '20231222'::date as reg_dt
FROM generate_series(1,10000);

And on the 22nd of December, we run the following query:

select *
  from t1
 where reg_dt >= '20231222'::date;

How do you suppose the optimizer will estimate the number of rows now? Can it recognize the addition of 10,000 new rows?

Let's examine the resulting execution plan:

Gather  (cost=1000.00..10688.43 rows=1 width=4)                   
  Workers Planned: 2                                              
  ->  Parallel Seq Scan on t1  (cost=0.00..9688.33 rows=1 width=4)
        Filter: (reg_dt >= '2023-12-22'::date) 

To our disappointment, the planner is estimating that it would return just one row, which poses a significant challenge in generating execution plans. This discrepancy stems from a lack of updated statistics for the new date, a predicament specific to PostgreSQL in database management. 
In Oracle when you provide a predicate which is outside either the upper bound or below the lower bound, the optimizer uses a scaling off algorithm. If the optimizer knows that the highest value is 21st of December, then if you query for the 22nd of December, the optimizer estimates it would retrieve around 90 % of the number of rows of the highest value. If you query for the 23rd then 80 % of the number of rows of the highest value, and eventually all the way down to one as you go further and further away into the future from the highest value that the optimizer knows about. To mitigate this issue, Oracle has real time statistics feature on Exadata. What real time statistics does is as you are doing DML, the database will actually track the DMLs and from time to tome Oracle updates the lowest and highest values for various columns. This is not like a full gathering of statistics.
On the other hand, in PostgreSQL if the predicate goes above the known bounday, PostgreSQL simply assumes one, a stark contrast to Oracle's approach.

Conclusion
Diverging from Oracle, PostgreSQL (as of version 15.1) significantly misestimates the number of rows for boundary conditions. You can employ the following strategies to mitigate this issue:

1. Frequent Statistics Gathering
    Adjust the table-level statistics gathering percentage thresholds. By default, a table needs to undergo a 20 % change in activity before statistics are updated.
2. You can fix the execution plan with hints. With this strategy you will always get a consistent plan. However, bear in mind that consistent plans do not necessarily mean consistent performance. If you use the hint, you are compromising performance potentially by making sure we do not get any performance dramas all of a sudden. Performance may degrade slowly over time.

   

postgresdba.com