PostgreSQL 은 특별합니다.
단순한 데이터베이스가 아니라, 하나의 어플리케이션 플랫폼이기 때문입니다.

공지사항

뉴스 및 기사

	총 게시물 178건, 최근 0 건

Index Design sample1

글쓴이 : 모델광 날짜 : 2021-11-13 (토) 10:42 조회 : 5136

This note was divided into two posts due to the volume limitation of this bulletin board. You can get the latter part of this note in the next post of this board.

It is challenging to provide any general recommendations regarding which indexes are necessary. I frequently use partial and covering indexes whenever it makes sense. Partial indexes (filtered indexes in SQL Server) are usually smaller than regular indexes and more likely to fit in the shared buffer. Covering indexes save trips to the table, thus allowing the engine to perform most processing in the shared buffer.

The following is a case study on how to design an efficient index.

What I'm aiming to present in this note is the pattern of thinking you should adopt in designing an index.

Here is the query to investigate. You can get the script to create the SALES table in the comment sectioni below.

SELECT QUANTITY, SALE_DATE, CHANNEL, SUBSIDIARY_ID FROM SALES WHERE SUBSIDIARY_ID IN (10,40) AND QUANTITY <= 5 GROUP BY QUANTITY, SALE_DATE, CHANNEL, SUBSIDIARY_ID ORDER BY SALE_DATE FETCH NEXT 100 ROWS ONLY;

--The execution plan

Limit (actual time=598.256..598.265 rows=100 loops=1) Buffers: shared hit=13912 read=8837, temp read=1578 written=3290 -> Sort (actual time=598.254..598.257 rows=100 loops=1) Sort Key: sale_date Sort Method: top-N heapsort Memory: 38kB Buffers: shared hit=13912 read=8837, temp read=1578 written=3290 -> HashAggregate (actual time=525.954..593.057 rows=62354 loops=1) Group Key: sale_date, quantity, channel, subsidiary_id Batches: 9 Memory Usage: 4177kB Disk Usage: 15440kB Buffers: shared hit=13912 read=8837, temp read=1578 written=3290 -> Seq Scan on sales (actual time=0.019..356.429 rows=595696 loops=1) Filter: ((subsidiary_id = ANY ('{10,40}'::numeric[])) AND (quantity <= 5)) Rows Removed by Filter: 1610765 Buffers: shared hit=13911 read=8837 Planning: Buffers: shared hit=2 Planning Time: 0.160 ms Execution Time: 601.082 ms

The task we have to do is to design an index that can improve the performance of the SQL above.

Here is the hypothetical situation we have.

1. The SQL should be run every 1 second.

2. The data in the sales table is updated frequently.

Let's assume that a row is modified every 10 second.

The restrictions above force us not to be able to make many indexes which may result in table/index bloat. We are in a dilemma.

Let's start thinking about the query. It's using the FETCH NEXT 100 ROWS ONLY syntax, which means we may have to find a lot of data and sort it before returning a subset.

In this case we're after the first 100 rows, which makes you think that maybe there will be a lot of data satisfying the query. So we have two targets to meet to optimize the query.

* acquire the data satisfying the where clause as efficiently as possible

* post-process the data we acquire to extract the 100 rows as soon as possible

We are going to do a full table scan or find a good indexed access path. So the first thing to consider is the volume of data that matches the predicates.

(Thinking Phase 1)

If there is only a "small" amount of data that matches the where clause, then an index on (subsidiary_id, quantity) may be very helpful. Let's check the cardinality.

select relname, reltuples from pg_class where relname='sales';

relname | reltuples ---------+-------------- sales | 2206461

select count(*)

from sales

WHERE SUBSIDIARY_ID IN (10,40)

AND QUANTITY <= 5;

count

--------

595696

The total number of rows is 2.2 million and the number of rows satisfying the predicate is 595696, which indicates that an index on (subsidiary_id, quantity) is a bad idea.

This, though, raises several questions that need to be answered:

1. How small is "small"? In the context of 2 million rows, 595696 doesn't seem to be small. But what number of rows compared to the 2.2 million should be considered small?

2. How many rows have SUBSIDIARY_ID '10', how many rows have SUBSIDIARY_ID '40'? The number of rows may be significantly different depending on the value of SUBSIDIARY_ID or QUANTITY.

You may have to change the where clause if the cardinalities are completely different depending on the value of SUBSIDIARY_ID or QUANTITY.

Below is the execution plan you get when you create an index on (subsidiary_id, quantity).

CREATE INDEX SALES_X02 ON SALES (SUBSIDIARY_ID, QUANTITY);

Limit (actual time=870.536..870.546 rows=100 loops=1) Buffers: shared hit=4579 read=20458, temp read=1578 written=3290 -> Sort (actual time=870.534..870.539 rows=100 loops=1) Sort Key: sale_date Sort Method: top-N heapsort Memory: 38kB Buffers: shared hit=4579 read=20458, temp read=1578 written=3290 -> HashAggregate (actual time=791.284..864.958 rows=62354 loops=1) Group Key: sale_date, quantity, channel, subsidiary_id Batches: 9 Memory Usage: 4177kB Disk Usage: 15440kB Buffers: shared hit=4579 read=20458, temp read=1578 written=3290 -> Bitmap Heap Scan on sales (actual time=73.497..545.356 rows=595696 loops=1) Recheck Cond: ((subsidiary_id = ANY ('{10,40}'::numeric[])) AND (quantity <= 5)) Heap Blocks: exact=22748 Buffers: shared hit=4578 read=20458 -> Bitmap Index Scan on sales_x02 (actual time=71.013..71.013 rows=595696 Index Cond: ((subsidiary_id = ANY ('{10,40}'::numeric[])) AND (quantity <= 5)) Buffers: shared hit=1 read=2287 Planning: Buffers: shared hit=7 Planning Time: 0.320 ms Execution Time: 873.815 ms

As expected the performance got worse. The optimizer extracted 595696 rows satisfying the predicates and aggregated those rows into 62354 rows. And then it sorted the aggregated result by sale_date. Most of the time was spent while accessing the table heap via the sales_x02 index, which is 545.356 - 71.013 ms.

The total block I/O and elapsed time are worse than those of full table scan.

(Thinking Phase 2)

In order to proceed, let's say we came to a conclusion that the index on (subsidiary_id, quantity) is inefficient. Then we have to design another index. This time let's try to eliminate the table heap access and sorting operation. Let's make an index on (SALE_DATE, SUBSIDIARY_ID, QUANTITY, PRODUCT_ID). Make sure that SALE_DATE should be in the first place to remove the sorting operation.

CREATE INDEX SALES_X03 ON SALES (SALE_DATE, SUBSIDIARY_ID, QUANTITY, CHANNEL);

Limit (actual time=0.439..1.830 rows=100 loops=1) Buffers: shared hit=4 read=25 -> Group (actual time=0.438..1.823 rows=100 loops=1) Group Key: sale_date, quantity, channel, subsidiary_id Buffers: shared hit=4 read=25 -> Incremental Sort (actual time=0.436..1.698 rows=933 loops=1) Sort Key: sale_date, quantity, channel, subsidiary_id Presorted Key: sale_date Full-sort Groups: 5 Sort Method: quicksort Average Memory: 30kB Peak Memory: 30kB Pre-sorted Groups: 5 Sort Method: quicksort Average Memory: 39kB Peak Memory: 39kB Buffers: shared hit=4 read=25 -> Index Only Scan using sales_x03 on sales (actual time=0.076..1.168 rows=942 loops=1) Index Cond: (quantity <= 5) Filter: (subsidiary_id = ANY ('{10,40}'::numeric[])) Rows Removed by Filter: 2235 Heap Fetches: 91 Buffers: shared hit=4 read=25 Planning: Buffers: shared hit=30 read=1 Planning Time: 0.248 ms Execution Time: 1.854 ms

모델광

2021-11-13 (토) 10:45

The following is the script to create the SALES table.
CREATE TABLE sales (
sale_id NUMERIC NOT NULL,
employee_id NUMERIC NOT NULL,
subsidiary_id NUMERIC NOT NULL,
sale_date DATE NOT NULL,
eur_value NUMERIC(17,2) NOT NULL,
product_id BIGINT NOT NULL,
quantity INTEGER NOT NULL,
CHANNEL VARCHAR(4) NOT NULL,
CONSTRAINT sales_pk
PRIMARY KEY (sale_id),
CONSTRAINT sales_emp_fk FOREIGN KEY (subsidiary_id, employee_id)
REFERENCES employees(subsidiary_id, employee_id)
);

SELECT SETSEED(0);

INSERT INTO sales (sale_id
, subsidiary_id, employee_id
, sale_date, eur_value
, product_id, quantity
, CHANNEL)
SELECT row_number() OVER (), data.*
FROM (
SELECT e.subsidiary_id, e.employee_id
, (CURRENT_DATE - CAST(RANDOM()*3650 AS NUMERIC) * INTERVAL '1 DAY') sale_date
, CAST(RANDOM()*100000 AS NUMERIC)/100 eur_value
, CAST(RANDOM()*25 AS NUMERIC) + 1 product_id
, CAST(RANDOM()*5 AS NUMERIC) + 1 quantity
, CASE WHEN GEN % 2 = 0 THEN 'ONLI' ELSE 'OFFL' END
FROM employees e
, GENERATE_SERIES(1, 18000) gen
WHERE MOD(employee_id, 7) = 4
-- AND gen < employee_id / 2
ORDER BY sale_date
) data
WHERE TO_CHAR(sale_date, 'D') <> '1';

VACUUM ANALYZE sales;

댓글주소

SELECT SETSEED(0);

INSERT INTO sales (sale_id
, subsidiary_id, employee_id
, sale_date, eur_value
, product_id, quantity
, CHANNEL)
SELECT row_number() OVER (), data.*
FROM (
SELECT e.subsidiary_id, e.employee_id
, (CURRENT_DATE - CAST(RANDOM()*3650 AS NUMERIC) * INTERVAL '1 DAY') sale_date
, CAST(RANDOM()*100000 AS NUMERIC)/100 eur_value
, CAST(RANDOM()*25 AS NUMERIC) + 1 product_id
, CAST(RANDOM()*5 AS NUMERIC) + 1 quantity
, CASE WHEN GEN % 2 = 0 THEN 'ONLI' ELSE 'OFFL' END
FROM employees e
, GENERATE_SERIES(1, 18000) gen
WHERE MOD(employee_id, 7) = 4
-- AND gen < employee_id / 2
ORDER BY sale_date
) data
WHERE TO_CHAR(sale_date, 'D') <> '1';

VACUUM ANALYZE sales;

E-MAIL: root@postgresdba.com