PostgreSQL 은 특별합니다.
단순한 데이터베이스가 아니라, 하나의 어플리케이션 플랫폼이기 때문입니다.

공지사항

뉴스 및 기사

	총 게시물 167건, 최근 0 건

Speeding Up COUNT(DISTINCT col)

글쓴이 : 모델광 날짜 : 2021-06-24 (목) 23:53 조회 : 2782

From time to time when I cannot find a solution to speed up a query, I google a solution. In most cases, someone has already posted the answer.

The tuning technique presented in this note was derived from the article below.

https://scidb.tistory.com/entry/COUNTDistinct-%EC%BB%AC%EB%9F%BC%EC%9D%98-%EC%84%B1%EB%8A%A5?category=122494

This note describes how to speed up COUNT(DISTINCT col) query.

Below is the SQL statement I am interested in followed by scripts to make a test table and its data.

SELECT PRODUCT_ID

,COUNT(DISTINCT EMPLOYEE_ID)

,SUM(QUANTITY)

,SUM(EUR_VALUE)

FROM SALES

GROUP BY PRODUCT_ID;

CREATE TABLE employees ( employee_id NUMERIC NOT NULL, first_name VARCHAR(1000) NOT NULL, last_name VARCHAR(1000) NOT NULL, date_of_birth DATE , phone_number VARCHAR(1000) NOT NULL, junk CHAR(1000) , CONSTRAINT employees_pk PRIMARY KEY (employee_id) );

CREATE FUNCTION random_string(minlen NUMERIC, maxlen NUMERIC) RETURNS VARCHAR(1000) AS $$ DECLARE rv VARCHAR(1000) := ''; i INTEGER := 0; len INTEGER := 0; BEGIN IF maxlen < 1 OR minlen < 1 OR maxlen < minlen THEN RETURN rv; END IF; len := floor(random()*(maxlen-minlen)) + minlen; FOR i IN 1..floor(len) LOOP rv := rv || chr(97+CAST(random() * 25 AS INTEGER)); END LOOP; RETURN rv; END; $$ LANGUAGE plpgsql;

INSERT INTO employees (employee_id, first_name, last_name, date_of_birth, phone_number, junk) SELECT GENERATE_SERIES , initcap(lower(random_string(2, 8))) , initcap(lower(random_string(2, 8))) , CURRENT_DATE - CAST(floor(random() * 365 * 10 + 40 * 365) AS NUMERIC) * INTERVAL '1 DAY' , CAST(floor(random() * 9000 + 1000) AS NUMERIC) , 'junk' FROM GENERATE_SERIES(1, 1000);

CREATE TABLE sales (

sale_id NUMERIC NOT NULL,

employee_id NUMERIC NOT NULL,

subsidiary_id NUMERIC NOT NULL,

sale_date DATE NOT NULL,

eur_value NUMERIC(17,2) NOT NULL,

product_id BIGINT NOT NULL,

quantity INTEGER NOT NULL,

CHANNEL VARCHAR(4) NOT NULL,

CONSTRAINT sales_pk

PRIMARY KEY (sale_id)

);

SELECT SETSEED(0);

INSERT INTO sales (sale_id

, subsidiary_id, employee_id

, sale_date, eur_value

, product_id, quantity

, CHANNEL)

SELECT row_number() OVER (), data.*

FROM (

SELECT gen % 100, e.employee_id

, (CURRENT_DATE - CAST(RANDOM()*3650 AS NUMERIC) * INTERVAL '1 DAY') sale_date

, CAST(RANDOM()*100000 AS NUMERIC)/100 eur_value

, CAST(RANDOM()*25 AS NUMERIC) + 1 product_id

, CAST(RANDOM()*5 AS NUMERIC) + 1 quantity

, CASE WHEN GEN % 2 = 0 THEN 'ONLI' ELSE 'OFFL' END

FROM employees e

, GENERATE_SERIES(1, 18000) gen

WHERE MOD(employee_id, 7) = 4

-- AND gen < employee_id / 2

ORDER BY sale_date

) data

WHERE TO_CHAR(sale_date, 'D') <> '1';

ANALYZE sales;

Here is the execution plan of the SQL statement we are investigating.

GroupAggregate (actual time=768.307..2014.173 rows=26 loops=1) Output: product_id, count(DISTINCT employee_id), sum(quantity), sum(eur_value) Group Key: sales.product_id Buffers: shared hit=16116 read=6633, temp read=22683 written=22751 -> Sort (actual time=750.573..955.085 rows=2206461 loops=1) Output: product_id, employee_id, quantity, eur_value Sort Key: sales.product_id Sort Method: external merge Disk: 79848kB Buffers: shared hit=16116 read=6633, temp read=20319 written=20375 -> Seq Scan on portal.sales (actual time=0.010..231.207 rows=2206461 loops=1) Output: product_id, employee_id, quantity, eur_value Buffers: shared hit=16115 read=6633 Planning Time: 0.069 ms Execution Time: 2021.849 ms

First, the optimizer scanned sales table to extract product_id, employee_id, quantity, eur_value. Then it sorted the result set by product_id. While sorting by product_id, it overran work_mem and spilled to disk(79848kB). Then GroupAggregate worked on sorted data. We cannot observe anything indicating how the optimizer processed the count(DISTINCT employee_id) part in the execution plan above. I can't explain how the optimizer processed count(DISTINCT employee_id). I just guess that the optimizer did a lot of work (for example, retrieving unique values of employee_id) to process the count(DISTINCT employee_id) part behind the scene, because I can observe huge elapsed time gap between Sort operation(955 ms) and GroupAggregate(2014 ms) operation. I can't explain why the optimizer doesn't use HashAggregate. Since we didn't require sorted data, the optimizer didn't have to use GroupAggregate. It would have been better to take advantage of HashAggregate. PostgreSQL doesn't provide any hints or tools with which the query writer forces the optimizer to use HashAggregate.

Now let's find a way to improve the performance of the SQL statement in investigation.

One of the principles tuners' have in mind is this:

"When there is a GROUP BY clause, try to minimize the data set to group by. When you group by huge data set, the DB server consumes a lot of CPU and memory."

Sales is a big table to group by in one go. So Let's make the data set to group by smaller by doing some re-engineering of the SQL statement.

SELECT PRODUCT_ID, COUNT(EMPLOYEE_ID), SUM(S_Q), SUM(E_V) FROM (

-- This code produces an intermediate result set. SELECT PRODUCT_ID, EMPLOYEE_ID

, SUM(QUANTITY) AS S_Q, SUM(EUR_VALUE) AS E_V FROM SALES GROUP BY PRODUCT_ID, EMPLOYEE_ID ) A GROUP BY PRODUCT_ID;

By producing an intermediate result set with an inline view, I relieved the load of CPU and had the work_mem less consumed. The following is its execution plan.

GroupAggregate (actual time=525.743..539.599 rows=26 loops=1)

Group Key: sales.product_id

Buffers: shared hit=15885 read=6891

-> Finalize GroupAggregate (actual time=525.323..539.162 rows=3718 loops=1)

Group Key: sales.product_id, sales.employee_id

Buffers: shared hit=15885 read=6891

-> Gather Merge (actual time=525.307..533.391 rows=11154 loops=1)

Workers Planned: 2

Workers Launched: 2

Buffers: shared hit=15885 read=6891

-> Sort (actual time=520.658..520.892 rows=3718 loops=3)

Sort Key: sales.product_id, sales.employee_id

Sort Method: quicksort Memory: 619kB

Buffers: shared hit=15885 read=6891

Worker 0: Sort Method: quicksort Memory: 619kB

Worker 1: Sort Method: quicksort Memory: 619kB

-> Partial HashAggregate (actual time=515.427..516.931 rows=3718 loops=3)

Group Key: sales.product_id, sales.employee_id

Batches: 1 Memory Usage: 1489kB

Buffers: shared hit=15857 read=6891

Worker 0: Batches: 1 Memory Usage: 1489kB

Worker 1: Batches: 1 Memory Usage: 1489kB

-> Parallel Seq Scan on sales (actual time=0.067..153.417 rows=735487 loops=

Buffers: shared hit=15857 read=6891

Planning:

Buffers: shared hit=45 read=1

Planning Time: 1.078 ms

Execution Time: 540.017 ms

We have to note that now HashAggregate kicks in. First, the optimizer grouped the intermediate result set by product_id, employee_id. Secondly, the optimizer sorted the result set by product_id, employee_id followed by GroupAggregate. Finally, GroupAggregate worked again without sorting operation because the intermediate result set was already sorted in the previous GroupAggrregate. When HashAggregate worked, the number of retrieved rows per parallel processes was just 3718. That's why Sort operation following didn't take long. And for some reason, the planner decided that the query can be parallelized.

By rewriting the query we could drop the elapsed time to about 70% of the elapsed time of the original statement and eliminate "spilling to disk".

You might think the revised SQL is a little messier and has poor readability, but you can't get anything without sacrifice.

모델광

2021-06-24 (목) 23:57

Wrap Up
1. Try to eliminate GroupAggregate. Mostly HashAggregate is faster.
2. When there is GROUP BY, try to make the data set to group by smaller.

Footnote
In the future PG version, I expect HashAggregate will be more frequently used. I will go into the details in other notes.

Addendum 1
Below is the execution plan produced and block I/Os when I forced the query planner not to use parallelism. I only adjusted a parallelism related parameter.

set max_parallel_workers_per_gahter=0;
--I ran the revised SQL statement.
GroupAggregate (actual time=865.406..865.958 rows=26 loops=1)
Group Key: sales.product_id
Buffers: shared hit=16063 read=6685
-> Sort (actual time=865.351..865.498 rows=3718 loops=1)
Sort Key: sales.product_id
Sort Method: quicksort Memory: 387kB
Buffers: shared hit=16063 read=6685
-> HashAggregate (actual time=863.870..864.609 rows=3718 loops=1)
Group Key: sales.product_id, sales.employee_id
Batches: 1 Memory Usage: 1489kB
Buffers: shared hit=16063 read=6685
-> Seq Scan on sales (actual time=0.102..228.240 rows=2206461 loops=1)
Buffers: shared hit=16063 read=6685
Planning:
Buffers: shared hit=30 read=3
Planning Time: 1.695 ms
Execution Time: 866.349 ms

Addendum 2
I provided the optimizer with the ndistinct extended statistics. But it didn't have any effect on the execution plan.

create statistics s(ndistinct) on product_id, employee_id from sales;
analyze sales;

I ran the query below to report the contents of the extended statistics.

select * from pg_stats_ext;

-[ RECORD 1 ]----------+-------------------------
schemaname | portal
tablename | sales
statistics_schemaname | portal
statistics_name | s
statistics_owner | scott
attnames | {employee_id,product_id}
kinds | {d}
n_distinct | {"2, 6": 3712}
dependencies |
most_common_vals |
most_common_val_nulls |
most_common_freqs |
most_common_base_freqs |

댓글주소

Wrap Up
1. Try to eliminate GroupAggregate. Mostly HashAggregate is faster.
2. When there is GROUP BY, try to make the data set to group by smaller.

Footnote
In the future PG version, I expect HashAggregate will be more frequently used. I will go into the details in other notes.

Addendum 1
Below is the execution plan produced and block I/Os when I forced the query planner not to use parallelism. I only adjusted a parallelism related parameter.

set max_parallel_workers_per_gahter=0;
--I ran the revised SQL statement.
 GroupAggregate (actual time=865.406..865.958 rows=26 loops=1)
   Group Key: sales.product_id
   Buffers: shared hit=16063 read=6685
   ->  Sort (actual time=865.351..865.498 rows=3718 loops=1)
         Sort Key: sales.product_id
         Sort Method: quicksort  Memory: 387kB
         Buffers: shared hit=16063 read=6685
         ->  HashAggregate (actual time=863.870..864.609 rows=3718 loops=1)
               Group Key: sales.product_id, sales.employee_id
               Batches: 1  Memory Usage: 1489kB
               Buffers: shared hit=16063 read=6685
               ->  Seq Scan on sales (actual time=0.102..228.240 rows=2206461 loops=1)
                     Buffers: shared hit=16063 read=6685
 Planning:
   Buffers: shared hit=30 read=3
 Planning Time: 1.695 ms
 Execution Time: 866.349 ms

Addendum 2
I provided the optimizer with the ndistinct extended statistics. But it didn't have any effect on the execution plan.

create statistics s(ndistinct) on product_id, employee_id from sales;
analyze sales;

I ran the query below to report the contents of the extended statistics.

select * from pg_stats_ext;

-[ RECORD 1 ]----------+-------------------------
schemaname             | portal
tablename              | sales
statistics_schemaname  | portal
statistics_name        | s
statistics_owner       | scott
attnames               | {employee_id,product_id}
kinds                  | {d}
n_distinct             | {"2, 6": 3712}
dependencies           |
most_common_vals       |
most_common_val_nulls  |
most_common_freqs      |
most_common_base_freqs |

PostgresDBA

2021-06-25 (금) 08:48

* 비밀글 입니다.

댓글주소

E-MAIL: root@postgresdba.com