설문조사
PostgreSQL/PPAS 관련 듣고 싶은 교육은


총 게시물 163건, 최근 0 건
   

Cardinality Estimation Rule

글쓴이 : 모델광 날짜 : 2021-07-06 (화) 21:05 조회 : 1797

The shape of the execution plan is based on the estimation of the cardinality and the cardinality is caculated by the statistics PostgreSQL has about the data.

A query planner produces different access paths, join methods, and join order depending on the estimated cardinality of the joining tables and the available indexes and constraints.

If there is a performance problem with a query, the large gap between the estimated rows and actual rows might be the first point of investigation.


When can PostgreSQL effectively estimate row counts? Unfortunately there are rare cases where PostgreSQL's estimation of the row count is reasonable.

The following are those rare cases.


* Conditions like col1 = 'A' are typically well-handled.

* Similary, conditions like col1 > 10 are usually well estimated based on a combination of most-common value estimates and the histogram of non-MCVs. Planner error in this area usually happens because the table is chaning quickly and auto-analyze gets behind.

* Conditions like col1 IS NULL are well-handled because the fraction of values that are null is one of the statistics PostgreSQL gathers.


Anything beyond above cases PostgreSQL planner has got issues when it comes to row-count estimates. In fact, most DBMS optimizer has the same issue. As for Oracle, it has several what I call heuristic rules like 1% RULE, 5% RULE, and then some.


I did some investigation whether PostgreSQL has similar rules.

The following is the script I have created to do some experiments.


drop table t1;

drop table t2;


create table t1

as

select i as id, rpad(chr(64+mod(i,26))::text,20,'x') as col2 from generate_series(1,10000) a(i);

create table t2 as select * from t1 order by id limit 1000;

analyze t1;

analyze t2;


Note that T1 has 10,000 rows and T2 has 1,000 rows.


SELECT *

  FROM T1

 WHERE ID BETWEEN (SELECT 1) AND (SELECT 10);

 Seq Scan on t1  (cost=0.02..224.02 rows=50 width=25)

   Filter: ((id >= $0) AND (id <= $1))

   InitPlan 1 (returns $0)

     ->  Result  (cost=0.00..0.01 rows=1 width=4)

   InitPlan 2 (returns $1)

     ->  Result  (cost=0.00..0.01 rows=1 width=4)


SELECT *

  FROM T1

 WHERE SUBSTR(COL2,10,2)='A';

 Seq Scan on t1  (cost=0.00..224.00 rows=50 width=25)

   Filter: (substr(col2, 10, 2) = 'A'::text)


SELECT *

  FROM T1

 WHERE id +1 = 2;


 Seq Scan on t1  (cost=0.00..224.00 rows=50 width=25)

   Filter: ((id + 1) = 2)


SELECT *

  FROM T1

 WHERE id::text = '1';


 Seq Scan on t1  (cost=0.00..249.00 rows=50 width=25)

   Filter: ((id)::text = '1'::text)


0.5% RULE

When the column in the predicate is manipulated, the optimizer can't utilize its statistics. In the above cases the estimated row count is 50. I would call this "0.5% RULE". 


SELECT *

  FROM T1 A

 WHERE NOT EXISTS (SELECT 1

                            FROM T2 B

                           WHERE A.ID = B.ID

                               and b.id = 3

                           OFFSET 0);

 Seq Scan on t1 a  (cost=0.00..205199.00 rows=5000 width=25)

   Filter: (NOT (SubPlan 1))

   SubPlan 1

     ->  Result  (cost=0.00..20.50 rows=1 width=4)

           One-Time Filter: (a.id = 3)

           ->  Seq Scan on t2 b  (cost=0.00..20.50 rows=1 width=0)

                 Filter: (id = 3)


SELECT *

  FROM T1 A

 WHERE EXISTS (SELECT 1

                            FROM T2 B

                           WHERE A.ID = B.ID

                               and b.id = 3

                           OFFSET 0);

 Seq Scan on t1 a  (cost=0.00..205199.00 rows=5000 width=25)

   Filter: (SubPlan 1)

   SubPlan 1

     ->  Result  (cost=0.00..20.50 rows=1 width=4)

           One-Time Filter: (a.id = 3)

           ->  Seq Scan on t2 b  (cost=0.00..20.50 rows=1 width=0)

                 Filter: (id = 3)


50% rule

First, you can see the OFFSET clause in the subquery which prevents the subquery collapse.  The second thing I particularly want to draw your attention to is the estimated row count, 5000 -  when you campare it with the actual number of rows of t1 you can see that it's a 50% guess.


SELECT *

  FROM T1

 WHERE ID <= (SELECT ID

                from T2

               WHERE ID = 10) ;

 Index Scan using t1_x01 on t1  (cost=20.79..147.11 rows=3333 width=25)

   Index Cond: (id <= $0)

   InitPlan 1 (returns $0)

     ->  Seq Scan on t2  (cost=0.00..20.50 rows=1 width=4)

           Filter: (id = 10)


33% rule

When <, >, <=, >= operators are used, PostgreSQL cannot decipher what will happen here, so it assumes that 33% of rows will match.


SELECT *

  FROM T1

 WHERE ID <= (SELECT ID

                from T2

               WHERE ID = 10)

   AND COL2 < (SELECT ID::TEXT

                from T2

               WHERE ID = 10);

 Index Scan using t1_x01 on t1  (cost=41.29..175.95 rows=1111 width=25)

   Index Cond: (id <= $0)

   Filter: (col2 < $1)

   InitPlan 1 (returns $0)

     ->  Seq Scan on t2  (cost=0.00..20.50 rows=1 width=4)

           Filter: (id = 10)

   InitPlan 2 (returns $1)

     ->  Seq Scan on t2 t2_1  (cost=0.00..20.50 rows=1 width=32)

           Filter: (id = 10)


Note that in the above query "<=" and "<" operators were used, so the estimated row count is calculated the following formula.

10,000 * 1/3 * 1/3 = 1111


SELECT *

  FROM T1

 WHERE ID != (SELECT 100);

 Seq Scan on t1  (cost=0.01..199.01 rows=9999 width=25)

   Filter: (id <> $0)

   InitPlan 1 (returns $0)

     ->  Result  (cost=0.00..0.01 rows=1 width=4)


1-cadinality RULE
The estimated row count of the condition id = $0 is 1. So the estimated row count of the condition id <> $0 is caculated by the following formula.
10,000 - 1 = 9999

Wrap Up
Sometimes PostgreSQL's estimation of row counts is based on heuristics.

   

postgresdba.com