PostgreSQL 은 특별합니다.
단순한 데이터베이스가 아니라, 하나의 어플리케이션 플랫폼이기 때문입니다.

공지사항

뉴스 및 기사

	총 게시물 176건, 최근 0 건

Cardinality Estimation Rule

글쓴이 : 모델광 날짜 : 2021-07-06 (화) 21:05 조회 : 3442

The shape of the execution plan is based on the estimation of the cardinality and the cardinality is caculated by the statistics PostgreSQL has about the data.

A query planner produces different access paths, join methods, and join order depending on the estimated cardinality of the joining tables and the available indexes and constraints.

If there is a performance problem with a query, the large gap between the estimated rows and actual rows might be the first point of investigation.

When can PostgreSQL effectively estimate row counts? Unfortunately there are rare cases where PostgreSQL's estimation of the row count is reasonable.

The following are those rare cases.

* Conditions like col1 = 'A' are typically well-handled.

* Similary, conditions like col1 > 10 are usually well estimated based on a combination of most-common value estimates and the histogram of non-MCVs. Planner error in this area usually happens because the table is chaning quickly and auto-analyze gets behind.

* Conditions like col1 IS NULL are well-handled because the fraction of values that are null is one of the statistics PostgreSQL gathers.

Anything beyond above cases PostgreSQL planner has got issues when it comes to row-count estimates. In fact, most DBMS optimizer has the same issue. As for Oracle, it has several what I call heuristic rules like 1% RULE, 5% RULE, and then some.

I did some investigation whether PostgreSQL has similar rules.

The following is the script I have created to do some experiments.

drop table t1;

drop table t2;

create table t1

select i as id, rpad(chr(64+mod(i,26))::text,20,'x') as col2 from generate_series(1,10000) a(i);

create table t2 as select * from t1 order by id limit 1000;

analyze t1;

analyze t2;

Note that T1 has 10,000 rows and T2 has 1,000 rows.

SELECT *

FROM T1

WHERE ID BETWEEN (SELECT 1) AND (SELECT 10);

Seq Scan on t1 (cost=0.02..224.02 rows=50 width=25)

Filter: ((id >= $0) AND (id <= $1))

InitPlan 1 (returns $0)

-> Result (cost=0.00..0.01 rows=1 width=4)

InitPlan 2 (returns $1)

-> Result (cost=0.00..0.01 rows=1 width=4)

SELECT *

FROM T1

WHERE SUBSTR(COL2,10,2)='A';

Seq Scan on t1 (cost=0.00..224.00 rows=50 width=25)

Filter: (substr(col2, 10, 2) = 'A'::text)

SELECT *

FROM T1

WHERE id +1 = 2;

Seq Scan on t1 (cost=0.00..224.00 rows=50 width=25)

Filter: ((id + 1) = 2)

SELECT *

FROM T1

WHERE id::text = '1';

Seq Scan on t1 (cost=0.00..249.00 rows=50 width=25)

Filter: ((id)::text = '1'::text)

0.5% RULE

When the column in the predicate is manipulated, the optimizer can't utilize its statistics. In the above cases the estimated row count is 50. I would call this "0.5% RULE".

SELECT *

FROM T1 A

WHERE NOT EXISTS (SELECT 1

FROM T2 B

WHERE A.ID = B.ID

and b.id = 3

OFFSET 0);

Seq Scan on t1 a (cost=0.00..205199.00 rows=5000 width=25)

Filter: (NOT (SubPlan 1))

SubPlan 1

-> Result (cost=0.00..20.50 rows=1 width=4)

One-Time Filter: (a.id = 3)

-> Seq Scan on t2 b (cost=0.00..20.50 rows=1 width=0)

Filter: (id = 3)

SELECT *

FROM T1 A

WHERE EXISTS (SELECT 1

FROM T2 B

WHERE A.ID = B.ID

and b.id = 3

OFFSET 0);

Seq Scan on t1 a (cost=0.00..205199.00 rows=5000 width=25)

Filter: (SubPlan 1)

SubPlan 1

-> Result (cost=0.00..20.50 rows=1 width=4)

One-Time Filter: (a.id = 3)

-> Seq Scan on t2 b (cost=0.00..20.50 rows=1 width=0)

Filter: (id = 3)

50% rule

First, you can see the OFFSET clause in the subquery which prevents the subquery collapse. The second thing I particularly want to draw your attention to is the estimated row count, 5000 - when you campare it with the actual number of rows of t1 you can see that it's a 50% guess.

SELECT *

FROM T1

WHERE ID <= (SELECT ID

from T2

WHERE ID = 10) ;

Index Scan using t1_x01 on t1 (cost=20.79..147.11 rows=3333 width=25)

Index Cond: (id <= $0)

InitPlan 1 (returns $0)

-> Seq Scan on t2 (cost=0.00..20.50 rows=1 width=4)

Filter: (id = 10)

33% rule

When <, >, <=, >= operators are used, PostgreSQL cannot decipher what will happen here, so it assumes that 33% of rows will match.