From time to time when I cannot find a solution to speed up a query, I google a solution. In most cases, someone has already posted the answer.
The tuning technique presented in this note was derived from the article below.
https://scidb.tistory.com/entry/COUNTDistinct-%EC%BB%AC%EB%9F%BC%EC%9D%98-%EC%84%B1%EB%8A%A5?category=122494
This note describes how to speed up COUNT(DISTINCT col) query.
Below is the SQL statement I am interested in followed by scripts to make a test table and its data.
SELECT PRODUCT_ID
,COUNT(DISTINCT EMPLOYEE_ID)
,SUM(QUANTITY)
,SUM(EUR_VALUE)
FROM SALES
GROUP BY PRODUCT_ID;
CREATE TABLE employees ( employee_id NUMERIC NOT NULL, first_name VARCHAR(1000) NOT NULL, last_name VARCHAR(1000) NOT NULL, date_of_birth DATE , phone_number VARCHAR(1000) NOT NULL, junk CHAR(1000) , CONSTRAINT employees_pk PRIMARY KEY (employee_id) ); |
CREATE FUNCTION random_string(minlen NUMERIC, maxlen NUMERIC) RETURNS VARCHAR(1000) AS $$ DECLARE rv VARCHAR(1000) := ''; i INTEGER := 0; len INTEGER := 0; BEGIN IF maxlen < 1 OR minlen < 1 OR maxlen < minlen THEN RETURN rv; END IF; len := floor(random()*(maxlen-minlen)) + minlen; FOR i IN 1..floor(len) LOOP rv := rv || chr(97+CAST(random() * 25 AS INTEGER)); END LOOP; RETURN rv; END; $$ LANGUAGE plpgsql; |
INSERT INTO employees (employee_id, first_name, last_name, date_of_birth, phone_number, junk) SELECT GENERATE_SERIES , initcap(lower(random_string(2, 8))) , initcap(lower(random_string(2, 8))) , CURRENT_DATE - CAST(floor(random() * 365 * 10 + 40 * 365) AS NUMERIC) * INTERVAL '1 DAY' , CAST(floor(random() * 9000 + 1000) AS NUMERIC) , 'junk' FROM GENERATE_SERIES(1, 1000); |
sale_id NUMERIC NOT NULL,
employee_id NUMERIC NOT NULL,
subsidiary_id NUMERIC NOT NULL,
sale_date DATE NOT NULL,
eur_value NUMERIC(17,2) NOT NULL,
product_id BIGINT NOT NULL,
quantity INTEGER NOT NULL,
CHANNEL VARCHAR(4) NOT NULL,
CONSTRAINT sales_pk
PRIMARY KEY (sale_id)
);
SELECT SETSEED(0);
INSERT INTO sales (sale_id
, subsidiary_id, employee_id
, sale_date, eur_value
, product_id, quantity
, CHANNEL)
SELECT row_number() OVER (), data.*
FROM (
SELECT gen % 100, e.employee_id
, (CURRENT_DATE - CAST(RANDOM()*3650 AS NUMERIC) * INTERVAL '1 DAY') sale_date
, CAST(RANDOM()*100000 AS NUMERIC)/100 eur_value
, CAST(RANDOM()*25 AS NUMERIC) + 1 product_id
, CAST(RANDOM()*5 AS NUMERIC) + 1 quantity
, CASE WHEN GEN % 2 = 0 THEN 'ONLI' ELSE 'OFFL' END
FROM employees e
, GENERATE_SERIES(1, 18000) gen
WHERE MOD(employee_id, 7) = 4
-- AND gen < employee_id / 2
ORDER BY sale_date
) data
WHERE TO_CHAR(sale_date, 'D') <> '1';
ANALYZE sales;
Here is the execution plan of the SQL statement we are investigating.
GroupAggregate (actual time=768.307..2014.173 rows=26 loops=1)
Output: product_id, count(DISTINCT employee_id), sum(quantity), sum(eur_value) Group Key: sales.product_id
Buffers: shared hit=16116 read=6633, temp read=22683 written=22751
-> Sort (actual time=750.573..955.085 rows=2206461 loops=1)
Output: product_id, employee_id, quantity, eur_value Sort Key: sales.product_id
Sort Method: external merge Disk: 79848kB Buffers: shared hit=16116 read=6633, temp read=20319 written=20375
-> Seq Scan on portal.sales (actual time=0.010..231.207 rows=2206461 loops=1)
Output: product_id, employee_id, quantity, eur_value
Buffers: shared hit=16115 read=6633
Planning Time: 0.069 ms
Execution Time: 2021.849 ms
First, the optimizer scanned sales table to extract product_id, employee_id, quantity, eur_value. Then it sorted the result set by product_id. While sorting by product_id, it overran work_mem and spilled to disk(79848kB). Then GroupAggregate worked on sorted data. We cannot observe anything indicating how the optimizer processed the count(DISTINCT employee_id) part in the execution plan above. I can't explain how the optimizer processed count(DISTINCT employee_id). I just guess that the optimizer did a lot of work (for example, retrieving unique values of employee_id) to process the count(DISTINCT employee_id) part behind the scene, because I can observe huge elapsed time gap between Sort operation(955 ms) and GroupAggregate(2014 ms) operation. I can't explain why the optimizer doesn't use HashAggregate. Since we didn't require sorted data, the optimizer didn't have to use GroupAggregate. It would have been better to take advantage of HashAggregate. PostgreSQL doesn't provide any hints or tools with which the query writer forces the optimizer to use HashAggregate.
Now let's find a way to improve the performance of the SQL statement in investigation.
One of the principles tuners' have in mind is this:
"When there is a GROUP BY clause, try to minimize the data set to group by. When you group by huge data set, the DB server consumes a lot of CPU and memory."
Sales is a big table to group by in one go. So Let's make the data set to group by smaller by doing some re-engineering of the SQL statement.
SELECT PRODUCT_ID, COUNT(EMPLOYEE_ID), SUM(S_Q), SUM(E_V)
FROM (
-- This code produces an intermediate result set.
SELECT PRODUCT_ID, EMPLOYEE_ID
, SUM(QUANTITY) AS S_Q, SUM(EUR_VALUE) AS E_V
FROM SALES
GROUP BY PRODUCT_ID, EMPLOYEE_ID ) A
GROUP BY PRODUCT_ID;
By producing an intermediate result set with an inline view, I relieved the load of CPU and had the work_mem less consumed. The following is its execution plan.
GroupAggregate (actual time=525.743..539.599 rows=26 loops=1)
Group Key: sales.product_id
Buffers: shared hit=15885 read=6891
-> Finalize GroupAggregate (actual time=525.323..539.162 rows=3718 loops=1)
Group Key: sales.product_id, sales.employee_id
Buffers: shared hit=15885 read=6891
-> Gather Merge (actual time=525.307..533.391 rows=11154 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=15885 read=6891
-> Sort (actual time=520.658..520.892 rows=3718 loops=3)
Sort Key: sales.product_id, sales.employee_id
Sort Method: quicksort Memory: 619kB
Buffers: shared hit=15885 read=6891
Worker 0: Sort Method: quicksort Memory: 619kB
Worker 1: Sort Method: quicksort Memory: 619kB
-> Partial HashAggregate (actual time=515.427..516.931 rows=3718 loops=3)
Group Key: sales.product_id, sales.employee_id
Batches: 1 Memory Usage: 1489kB
Buffers: shared hit=15857 read=6891
Worker 0: Batches: 1 Memory Usage: 1489kB
Worker 1: Batches: 1 Memory Usage: 1489kB
-> Parallel Seq Scan on sales (actual time=0.067..153.417 rows=735487 loops=
Buffers: shared hit=15857 read=6891
Planning:
Buffers: shared hit=45 read=1
Planning Time: 1.078 ms
Execution Time: 540.017 ms
We have to note that now HashAggregate kicks in. First, the optimizer grouped the intermediate result set by product_id, employee_id. Secondly, the optimizer sorted the result set by product_id, employee_id followed by GroupAggregate. Finally, GroupAggregate worked again without sorting operation because the intermediate result set was already sorted in the previous GroupAggrregate. When HashAggregate worked, the number of retrieved rows per parallel processes was just 3718. That's why Sort operation following didn't take long. And for some reason, the planner decided that the query can be parallelized.
By rewriting the query we could drop the elapsed time to about 70% of the elapsed time of the original statement and eliminate "spilling to disk".
You might think the revised SQL is a little messier and has poor readability, but you can't get anything without sacrifice.