설문조사
PostgreSQL/PPAS 관련 듣고 싶은 교육은


총 게시물 85건, 최근 0 건
 

Index Design sample2

글쓴이 : 모델광 날짜 : 2021-11-13 (토) 10:48 조회 : 68

Here we can see Incremental Sort which is a new feature in version 13. Thanks to the new feature, the optimizer doesn't have to sort the entire (sale_date, quantity, channel, subsidiary_id) block. It only needs to sort (quantity, channel, subsidiary_id) block. So it can avoid more demanding external sort. Hmm... It seems that we need to change the order of the column on the index SALES_X03 to reduce the sorting load. Let's change the order of the columns on the index.


drop index SALES_X03;

CREATE INDEX SALES_X03 ON SALES (SALE_DATE, QUANTITY, CHANNEL, SUBSIDIARY_ID);


Limit (actual time=0.060..1.465 rows=100 loops=1)

   Buffers: shared hit=4 read=19

   ->  Group (actual time=0.059..1.451 rows=100 loops=1)

         Group Key: sale_date, quantity, channel, subsidiary_id

         Buffers: shared hit=4 read=19

         ->  Index Only Scan using sales_x03 on sales (actual time=0.056..1.210 rows=933 loops=1)

               Index Cond: (quantity <= 5)

               Filter: (subsidiary_id = ANY ('{10,40}'::numeric[]))

               Rows Removed by Filter: 2207

               Heap Fetches: 91

               Buffers: shared hit=4 read=19

 Planning:

   Buffers: shared hit=27 read=1

 Planning Time: 0.432 ms

 Execution Time: 1.503 ms


Here we can't see any sorting operation. The total block I/O decreased from (4+25) to (4+19). The elapsed time decreased from 1.854 to 1.503 ms. You may think that this is it and now we can ship this index onto the production system. However, we have to deal with the possibility of the index bloat.

The disadvantage to this index ( SALE_DATE, QUANTITY, CHANNEL, SUBSIDIARY_ID) is:

- We have indexed every row in the 2.2M rows table with a 4-column index. And those columns are likely to be subject to several changes. Even when the non-indexed columns are updated, the index we created has to be inserted rows as well. This is the PostgreSQL architecture feature. I will not go into details of the architecture in this note.


(Thinking Phase 3)

So we have to have indexes on fewer columns if possible. Furthermore, we have to find out how to reduce the size of the index.

So now let's make an index on the column SALE_DATE only.


CREATE INDEX SALES_X04 ON SALES(SALE_DATE);

 Limit (actual time=0.597..1.811 rows=100 loops=1)

   Buffers: shared hit=39 read=6

   ->  Group (actual time=0.595..1.804 rows=100 loops=1)

         Group Key: sale_date, quantity, channel, subsidiary_id

         Buffers: shared hit=39 read=6

         ->  Incremental Sort (actual time=0.594..1.683 rows=933 loops=1)

               Sort Key: sale_date, quantity, channel, subsidiary_id

               Presorted Key: sale_date

               Full-sort Groups: 5  Sort Method: quicksort  Average Memory: 30kB  Peak Memory: 30kB

               Pre-sorted Groups: 5  Sort Method: quicksort  Average Memory: 39kB  Peak Memory: 39kB

               Buffers: shared hit=39 read=6

               ->  Index Scan using sales_x04 on sales (actual time=0.161..1.062 rows=942 loops=1)

                     Filter: ((subsidiary_id = ANY ('{10,40}'::numeric[])) AND (quantity <= 5))

                     Rows Removed by Filter: 2573

                     Buffers: shared hit=39 read=6

 Planning:

   Buffers: shared hit=18 read=1

 Planning Time: 1.560 ms

 Execution Time: 2.049 ms


People from Oracle background will not be able to understand this execution plan. Even though there is no SALE_DATE column in the where clause, the optimizer accessed SALE_DATE index, which is not possible in ORACLE. And PostgreSQL 13 new feature, Incremental Sort, kicked in. In this particular pattern of data, the index on SALE_DATE worked well. But if there are not many rows filtered by the where clause, this index can be a disaster because we have to access broader range of the index. Let's say that the rows filtered by (subsidiary_id, quantity) are small and we decided not to use the index on (subsidiary_id, quantity) for some reason. Then how can we improve the performance of the query using the index on SALE_DATE?


(Thinking Phase 4)

Let's suppose that:

- in the where SUBSIDIARY_ID IN (10,40) AND QUANTITY <= 5 clause, quantity value '5' is a fixed value.

- (subsidiary_id, quantity) can filter many rows but we cannot make an index on the column for some reason. For example, (subsidiary_id, quantity) index can decrease the performance of another query.

Now we may use a filtered index.


CREATE INDEX SALES_X05 ON SALES(SALE_DATE)

WHERE QUANTITY <= 5 AND SUBSIDIARY_ID IN (10,40);


모델광 2021-11-13 (토) 10:52
Limit (actual time=0.428..1.184 rows=100 loops=1)
  Buffers: shared hit=39 read=3
  ->  Group (actual time=0.427..1.178 rows=100 loops=1)
        Group Key: sale_date, quantity, channel, subsidiary_id
        Buffers: shared hit=39 read=3
        ->  Incremental Sort (actual time=0.426..1.003 rows=933 loops=1)
              Sort Key: sale_date, quantity, channel, subsidiary_id
              Presorted Key: sale_date
              Full-sort Groups: 5  Sort Method: quicksort  Average Memory: 30kB  Peak Memory: 30kB
              Pre-sorted Groups: 5  Sort Method: quicksort  Average Memory: 39kB  Peak Memory: 39kB
              Buffers: shared hit=39 read=3
              ->  Index Scan using sales_x05 on sales (actual time=0.256..0.498 rows=942 loops=1)
                    Buffers: shared hit=39 read=3
 Planning:
  Buffers: shared hit=19 read=1
 Planning Time: 0.882 ms
 Execution Time: 1.225 ms

The execution plan is somewhat tricky. If someone else created the index, we would be at a loss about the plan above, because we cannot see the filtering operation regarding the QUANTITY column. We will have to search the pg_indexes view like this:

SELECT tablename, indexdef FROM PG_INDEXES WHERE TABLENAME='sales';

By looking at the indexdef column, we can get the DDL which created the index.
Anyway, we can say that accessing sales_x05 the optimizer processed the predicates.

(Thinking Phase 5)
One of the options we have when we cannot create an index is to use partitioning. Partitioning is a way of splitting or dividing a large table into small pieces. To improve the performance of the query, we can use the SUBSIDIARY_ID as the partition key. Then the query doesn't have to scan the whole sales table, it  just needs to scan several partitions instead, which will decrease the block I/O.

(Thinking Phase 6)
If there are not updates, inserts, deletes on the sales table, we may consider using block range index. BRIN is incredibly helpful in efficiently searching over large data and has the benefit of taking up significantly less space on disk than a B-tree index. We can see the benefit of BRIN indexes as the base table grows to significant size.

Footnote 1
When there are a lot of update transactions on the table, we need to optimize some storage parameters. We need to set the fillfactor on the table sales to a smaller value like 90. For a table, whose entries are never updated, the default value 100 is the best choice, but in heavily updated tables smaller fillfactors are appropriate. We can change the fillfactor value via some SQL like this:

ALTER TABLE sales SET (fillfactor=90);

You can check the value of the specific table or index by using the script below.

SELECT pc.relname AS ObjectName,pc.reloptions AS ObjectOptions
FROM pg_class AS pc INNER JOIN pg_namespace AS pns
  ON pns.oid = pc.relnamespace
WHERE pns.nspname = 'public'
  AND pc.relname = 'sales';

On the other hand, index fillfactor default value is 90.

On top of the fillfactor, we also need to observe dead tuples in the table. We can monitor it by issuing the following command periodically.

SELECT schemaname, relname, n_live_tup, n_dead_tup, last_auto_vacuum
 FROM pg_stat_user_tables
WHERE relname='sales';

If the ratio of n_dead_tup to n_live_tup is high, you need to set some autovacuum parameters aggresively.

Footnote 2
Even though we are not concerned with extra disk space needed for indexes because disks are cheap, we do not want to create database objects that are useless. You can check the following catalog view to monitor whether the indexes you created are being used.

SELECT * FROM pg_stat_all_indexes;

Note that some primary key indexes are never used for data retrieval, however, they are vital for data integrity and should not be removed.

Footnote 3
It is possible that a condition for a filtered index was very restrictive and index access was very fast, but later, with more and more records satisfying the condition, the index became less efficient.
Although we strive to make sure that a query will perform well even when data volumes grow, we can't assume that anything is optimized forever. We should always keep an eye on data volume, value distributions and other characteristics that can interfere with query performance.
댓글주소
 

postgresdba.com