설문조사
PostgreSQL/PPAS 관련 듣고 싶은 교육은


총 게시물 162건, 최근 0 건
   

Indexing Small Tables

글쓴이 : 모델광 날짜 : 2022-11-05 (토) 11:12 조회 : 865

Recently while browsing SQL Server documents I ran into absolute garbage advice. And the document did not present any evidence to prove its bull shit assertion.


Here is the URL of the document followed by its garbage advice.

https://learn.microsoft.com/en-us/sql/relational-databases/sql-server-index-design-guide?view=sql-server-ver16


* Indexing small tables may not be optimal because it can take the query optimizer longer to traverse the index searching for data than to perform a simple table scan. Therefore, indexes on small tables might never be used, but must still be maintained as data in the table changes.


In this note I will show you why the guideline in the URL above is rubbish or at least the guideline is a 95 percent lie. I understand you may think the guideline can be controversial, but please bear with me while I explain my point of view.


I will start with some data and simple queries running under PostgreSQL 14.


create table t1
as
select i as empno, repeat('a',10) c2
  from generate_series(1,150) a(i);


CREATE UNIQUE INDEX t1_X1 ON t1(EMPNO);


VACUUM t1;


SELECT relname, reltuples, relpages
  FROM PG_CLASS
 WHERE RELNAME='t1';

+---------+-----------+----------+
| relname | reltuples | relpages |
+---------+-----------+----------+
| t1      | 150.0     | 1        |
+---------+-----------+----------+


I have created and popluated a table called T1. The table is so small. It occupies one block. Surely there is no faster way of reading the whole records. But when we need to read just one row from one block table, will an index be unnecessary? Instead of doing a thought experiment, let's do the test.


I have decreased random_page_cost to 1.1 (if you use the default value, the planner does not access the index.)


set random_page_cost to 1.1;


Here is the query we're interested in:


SELECT *
  FROM T1
 WHERE EMPNO = 7;


Here is the execution plan (in PostgreSQL 14 with the planner cost parameter RANDOM_PAGE_COST set to 1.1).


Index Scan using t1_x1 on t1 (cost=0.14..2.16 rows=1 width=15) (actual time=0.017..0.018 rows=1 loops=1)
Index Cond: (empno = 7)
Buffers: shared hit=2
Planning Time: 0.082 ms
Execution Time: 0.037 ms


Note that the optimizer accessed two blocks of which 1 block comes from the index

 access and the other comes from the table access. For the purpose of testing, I have created a table called T2, which doesn't have an index:


CREATE TABLE T2 AS
SELECT * FROM T1;


VACUUM T2;


I have run a query to retrieve a row.


SELECT *
  FROM T2
 WHERE EMPNO = 7;


Let's examine the execution plan which is as follows:


Seq Scan on t2  (cost=0.00..2.88 rows=1 width=15) (actual time=0.012..0.012 rows=1 loops=1)
  Filter: (empno = 7)
  Rows Removed by Filter: 149
  Buffers: shared hit=1
Planning:
  Buffers: shared hit=7 dirtied=1
Planning Time: 0.278 ms
Execution Time: 0.037 ms


The plan is fairly simple. An important observation is that PostgreSQL accessed one table block and filtered out 149 rows. When you optimize a query, you have to focus on the number of block IOs, not timing. If your database is running on a cloud environment, more block IOs mean more money. Considering the number of block IOs no index access seems to be the winner. Critically, though, we should ask "what is the impact on the elapsed time?" With one time execution we cannot tell which access path is better in terms of the elapsed time.

Now let's do a performance test on those two tables to see which access path is faster, index access path or full table scan path.


analdb=# \timing

Timing is on.


DO $body$
DECLARE
  v1  int; v2 varchar;
BEGIN
 FOR i IN 1 .. 20000 LOOP
   SELECT empno, c2
     INTO v1, v2
     FROM T1               --index access
    WHERE EMPNO = 7;
 END LOOP;
END;
$body$;

Time: 50.538 ms


DO $body$
DECLARE
  v1  int; v2 varchar;
BEGIN
 FOR i IN 1 .. 20000 LOOP
   SELECT empno, C2
     INTO v1, v2
     FROM T2              --full table scan access
    WHERE EMPNO = 7;
 END LOOP;
END;
$body$;

Time: 192.457 ms


As you can see,we can get the job done with an index 4 times as fast. Performance wise it is better than no indexes there. Four times as fast actually means one fourth as much CPU. Having the index even for a one block table for index access lookup is going to be better.

No index access has less block IOs but it burns more CPU filtering out 149 rows.

This is just an edge case example and a table with just one block is rare. With more than two blocks in a table, the advantage of block IO sinks.


I have reached the end of this article. I hope that now you undestand indexes will help tiny tables.


Footnote

It is very difficult to think of a situation where an index on a small table harms performance. If there are frequent update operations on a tiny table, the cost of maintaining an index can be expensive. However, it is rare to have many update operations on a single block table. I argue we may restrict the PostgreSQL optimizer's options by deciding not to index a small table.


   

postgresdba.com