Recently I was asked to optimize a query that involved joining a dimension table and a fact table, and using an aggregate function. Here is the shape of the query I was tasked with tuning.
SELECT TO_CHAR(E.DATE_OF_BIRTH,'YYYY'), SUM(S.QUANTITY)
FROM EMPLOYEES E, SALES S
WHERE E.EMPLOYEE_ID = S.EMPLOYEE_ID
AND S.EMPLOYEE_ID BETWEEN 1 AND 1000
GROUP BY TO_CHAR(DATE_OF_BIRTH,'YYYY');
If you want to do experiments, here is the code to generate some test data.
CREATE TABLE employees (
employee_id NUMERIC NOT NULL
, first_name VARCHAR(1000) NOT NULL
, last_name VARCHAR(1000) NOT NULL
, date_of_birth DATE
, phone_number VARCHAR(1000) NOT NULL
, junk CHAR(1000)
, CONSTRAINT employees_pk PRIMARY KEY (employee_id) );
CREATE FUNCTION random_string(minlen NUMERIC, maxlen NUMERIC)
RETURNS VARCHAR(1000)
AS $$
DECLARE rv VARCHAR(1000) := '';
i INTEGER := 0;
len INTEGER := 0;
BEGIN
IF maxlen < 1OR minlen < 1OR maxlen < minlen
THEN RETURN rv;
END IF;
len := floor(random()*(maxlen-minlen)) + minlen;
FOR i IN 1..floor(len) LOOP
rv := rv || chr(97+CAST(random() * 25 AS INTEGER));
END LOOP;
RETURN rv;
END;
$$ LANGUAGE plpgsql;
INSERT
INTO employees (employee_id, first_name, last_name, date_of_birth, phone_number, junk)
SELECT GENERATE_SERIES ,
initcap(lower(random_string(2, 8))) ,
initcap(lower(random_string(2, 8))) ,
CURRENT_DATE - CAST(floor(random() * 365 * 10 + 40 * 365) AS NUMERIC) * INTERVAL '1 DAY' ,
CAST(floor(random() * 9000 + 1000) AS NUMERIC) ,
'junk'
FROM GENERATE_SERIES(1, 1000);
CREATE TABLE sales (
sale_id NUMERIC NOT NULL,
employee_id NUMERIC NOT NULL,
subsidiary_id NUMERIC NOT NULL,
sale_date DATE NOT NULL,
eur_value NUMERIC(17,2) NOT NULL,
product_id BIGINT NOT NULL,
quantity INTEGER NOT NULL,
CHANNEL VARCHAR(4) NOT NULL,
CONSTRAINT sales_pk
PRIMARY KEY (sale_id, SALE_DATE)
) PARTITION BY RANGE (SALE_DATE);
CREATE TABLE SALES_MIN PARTITION OF SALES
FOR VALUES FROM (MINVALUE) TO ('2021-01-01');
CREATE TABLE SALES_2021 PARTITION OF SALES
FOR VALUES FROM ('2021-01-01') TO ('2022-01-01');
CREATE TABLE SALES_2022 PARTITION OF SALES
FOR VALUES FROM ('2022-01-01') TO ('2023-01-01');
CREATE TABLE SALES_MAX PARTITION OF SALES
FOR VALUES FROM ('2023-01-01') TO (MAXVALUE);
SELECT SETSEED(0);
INSERT INTO sales (sale_id
, subsidiary_id, employee_id
, sale_date, eur_value
, product_id, quantity
, CHANNEL)
SELECT row_number() OVER (), data.*
FROM (
SELECT gen % 100, e.employee_id
, (CURRENT_DATE - CAST(RANDOM()*3650 AS NUMERIC) * INTERVAL '1 DAY') sale_date
, CAST(RANDOM()*100000 AS NUMERIC)/100 eur_value
, CAST(RANDOM()*25 AS NUMERIC) + 1 product_id
, CAST(RANDOM()*5 AS NUMERIC) + 1 quantity
, CASE WHEN GEN % 2 = 0 THEN 'ONLI' ELSE 'OFFL' END
FROM employees e
, GENERATE_SERIES(1, 18000) gen
WHERE MOD(employee_id, 7) = 4
-- AND gen < employee_id / 2
ORDER BY sale_date
) data
WHERE TO_CHAR(sale_date, 'D') <> '1';
DROP INDEX SALES_X01;
CREATE INDEX SALES_X01 ON SALES(SALE_DATE);
One of the golden rules in query tuning is to minimize the number of rows involved in the join. With the golden rule in mind, I was able to improve the query performance with ease by reducing the number of rows participating in the join. I achieved this by implementing the GROUP BY operation on the SALES table first.
Here is the query that I rewrote:
SELECT TO_CHAR(DATE_OF_BIRTH,'YYYY'), SUM(S.QUANTITY)
FROM EMPLOYEES E
, (SELECT EMPLOYEE_ID, SUM(S.QUANTITY) AS QUANTITY
FROM SALES S
WHERE S.EMPLOYEE_ID BETWEEN 1 AND 1000
GROUP BY EMPLOYEE_ID
) S
WHERE E.EMPLOYEE_ID = S.EMPLOYEE_ID
GROUP BY TO_CHAR(DATE_OF_BIRTH,'YYYY');
I have disabled parallel processing for the purspose of simplicity and clarity in my explanation during the test.
SET MAX_PARALLEL_WORKERS_PER_GATHER = 0;
I have added the execution plan of the original query followed by the execution plan of the rewritten query:
HashAggregate (actual time=1798.895..1798.901 rows=11 loops=1)
Group Key: to_char((e.date_of_birth)::timestamp with time zone, 'YYYY'::text)
Batches: 1 Memory Usage: 49kB
Buffers: shared hit=1640 read=21251
-> Hash Join (actual time=0.455..1537.728 rows=2206370 loops=1)
Hash Cond: (s.employee_id = e.employee_id)
Buffers: shared hit=1640 read=21251
-> Append (actual time=0.077..610.388 rows=2206370 loops=1)
Buffers: shared hit=1497 read=21251
-> Seq Scan on sales_min s_1 (actual time=0.076..345.943 rows=1731877 loops=1)
Filter: ((employee_id >= '1'::numeric) AND (employee_id <= '1000'::numeric))
Buffers: shared hit=96 read=17759
-> Seq Scan on sales_2021 s_2 (actual time=0.041..49.016 rows=220555 loops=1)
Filter: ((employee_id >= '1'::numeric) AND (employee_id <= '1000'::numeric))
Buffers: shared hit=1059 read=1215
-> Seq Scan on sales_2022 s_3 (actual time=0.026..46.350 rows=220021 loops=1)
Filter: ((employee_id >= '1'::numeric) AND (employee_id <= '1000'::numeric))
Buffers: shared hit=342 read=1927
-> Seq Scan on sales_max s_4 (actual time=0.041..24.626 rows=33917 loops=1)
Filter: ((employee_id >= '1'::numeric) AND (employee_id <= '1000'::numeric))
Buffers: shared read=350
-> Hash (actual time=0.354..0.354 rows=1000 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 49kB
Buffers: shared hit=143
-> Seq Scan on employees e (actual time=0.005..0.253 rows=1000 loops=1)
Buffers: shared hit=143
Planning Time: 0.193 ms
Execution Time: 1798.941 ms
Here is the plan of the rewritten query.
GroupAggregate (actual time=890.922..890.943 rows=11 loops=1)
Group Key: (to_char((e.date_of_birth)::timestamp with time zone, 'YYYY'::text))
Buffers: shared hit=5164 read=17727
-> Sort (actual time=890.914..890.923 rows=143 loops=1)
Sort Key: (to_char((e.date_of_birth)::timestamp with time zone, 'YYYY'::text))
Sort Method: quicksort Memory: 32kB
Buffers: shared hit=5164 read=17727
-> Hash Join (actual time=890.596..890.887 rows=143 loops=1)
Hash Cond: (e.employee_id = s.employee_id)
Buffers: shared hit=5164 read=17727
-> Seq Scan on employees e (actual time=0.009..0.158 rows=1000 loops=1)
Buffers: shared hit=143
-> Hash (actual time=890.570..890.572 rows=143 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 15kB
Buffers: shared hit=5021 read=17727
-> Subquery Scan on s (actual time=890.523..890.548 rows=143 loops=1)
Buffers: shared hit=5021 read=17727
-> HashAggregate (actual time=890.522..890.537 rows=143 loops=1)
Group Key: s_1.employee_id
Batches: 1 Memory Usage: 48kB
Buffers: shared hit=5021 read=17727
-> Append (actual time=0.089..533.670 rows=2206370 loops=1)
Buffers: shared hit=5021 read=17727
-> Seq Scan on sales_min s_2 (actual time=0.089..315.173 rows=1731877 loops=1)
Filter: ((employee_id >= '1'::numeric) AND (employee_id <= '1000'::numeric))
Buffers: shared hit=128 read=17727
-> Seq Scan on sales_2021 s_3 (actual time=0.010..36.716 rows=220555 loops=1)
Filter: ((employee_id >= '1'::numeric) AND (employee_id <= '1000'::numeric))
Buffers: shared hit=2274
-> Seq Scan on sales_2022 s_4 (actual time=0.015..38.771 rows=220021 loops=1)
Filter: ((employee_id >= '1'::numeric) AND (employee_id <= '1000'::numeric))
Buffers: shared hit=2269
-> Seq Scan on sales_max s_5 (actual time=0.013..5.878 rows=33917 loops=1)
Filter: ((employee_id >= '1'::numeric) AND (employee_id <= '1000'::numeric))
Buffers: shared hit=350
Planning Time: 0.204 ms
Execution Time: 890.987 ms
The following is a quick description of the second plan:
1) The GroupAggregate node at the top part of the plan calls the Sort node.
2) The Sort node then calls the Hash Join node.
3) The Hash Join node calls the Seq Scan on employees node.
4) The optimizer checks if any rows exist in the EMPLOYEES table.
If it doesn't find a single row, the optimizer does nothing. Since rows are present, it proceeds with the following steps.
5) The Hash Join node in the step 3) calls the Hash node, which is at the same level as the Seq Scan on employees node.
6) The Hash node calls the Subquery Scan node.
7) The Subquery Scan node calls the HashAggregate node.
8) The HashAggregate node calls the Append node.
9) The Append node calls four Seq Scan nodes.
10) Four Seq Scan nodes scan four partitioned tables respectively, all filtered by EMPLOYEE_ID. Then, the four Seq Scan nodes sends the retrieved rows to the Append node.
11) The Append node passes the received rows up to the HashAggregate node.
12) The HashAggregate node aggregates the rows using a hash aggregate operation by the EMPLOYEE_ID column.
13) The Subquery Scan node receives the aggregated rows and passes them up to the Hash node.
14) The Hash node builds a hash table from the intermediate results of the Subquery Scan.
15) The Hah Join node probes the hash table with the rows received from the Seq Scan on employees.
16) The sort node sorts the intermediate data set by TO_CHAR(E.DATE_OF_BIRTH).
17) The GroupAggregate node groups the data by TO_CHAR(E.DATE_OF_BIRTH).
Note that although the number of block IOs is the same, the elapsed time has been reduced by half. This is because we decreased the number of rows participating in the join from 2206370 to 143 from the SALES table. On top of that, the number of fuction calls to TO_CHAR(DATE_OF_BIRTH,'YYYY') decreased from 2206370 to 143. This is a perfect example of killing two birds with one stone.
In the second execution plan I have highlighted to_char((e.date_of_birth)::timestamp with time zone, 'YYYY'::text). Dates in any programming languages are tricky to work with, and PostgreSQL is no exception. Date manipulation is an essential skill for any developers, whether it is extracting the year from a database, date of birth column, or dealing with timestamp in a server log.
To get a year value from the DATE_OF_BIRTH column, PostgreSQL used two internal functions , ::TIMESTAMP and TO_CHAR, which is not efficient. We can extract a year from one function, EXTRACT. Using the EXTRACT function I have rewritten the query as follows:
EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
SELECT EXTRACT(YEAR FROM date_of_birth), SUM(S.QUANTITY)
FROM EMPLOYEES E
, (SELECT EMPLOYEE_ID, SUM(S.QUANTITY) AS QUANTITY
FROM SALES S
WHERE S.EMPLOYEE_ID BETWEEN 1 AND 1000
GROUP BY EMPLOYEE_ID
) S
WHERE E.EMPLOYEE_ID = S.EMPLOYEE_ID
GROUP BY EXTRACT(YEAR FROM date_of_birth).
Here is the execution plan obtained by running the EXPLAN command.
GroupAggregate (actual time=893.377..893.400 rows=11 loops=1)
Group Key: (EXTRACT(year FROM e.date_of_birth))
Buffers: shared hit=5388 read=17503
-> Sort (actual time=893.370..893.379 rows=143 loops=1)
Sort Key: (EXTRACT(year FROM e.date_of_birth))
Sort Method: quicksort Memory: 32kB
Buffers: shared hit=5388 read=17503
-> Hash Join (actual time=893.067..893.338 rows=143 loops=1)
Hash Cond: (e.employee_id = s.employee_id)
Buffers: shared hit=5388 read=17503
-> Seq Scan on employees e (actual time=0.010..0.159 rows=1000 loops=1)
Buffers: shared hit=143
-> Hash (actual time=893.045..893.047 rows=143 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 15kB
Buffers: shared hit=5245 read=17503
-> Subquery Scan on s (actual time=892.966..892.993 rows=143 loops=1)
Buffers: shared hit=5245 read=17503
-> HashAggregate (actual time=892.965..892.980 rows=143 loops=1)
Group Key: s_1.employee_id
Batches: 1 Memory Usage: 48kB
Buffers: shared hit=5245 read=17503
-> Append (actual time=0.052..535.095 rows=2206370 loops=1)
Buffers: shared hit=5245 read=17503
-> Seq Scan on sales_min s_2 (actual time=0.052..309.003 rows=1731877 loops=1)
Filter: ((employee_id >= '1'::numeric) AND (employee_id <= '1000'::numeric))
Buffers: shared hit=352 read=17503
-> Seq Scan on sales_2021 s_3 (actual time=0.012..39.384 rows=220555 loops=1)
Filter: ((employee_id >= '1'::numeric) AND (employee_id <= '1000'::numeric))
Buffers: shared hit=2274
-> Seq Scan on sales_2022 s_4 (actual time=0.013..44.216 rows=220021 loops=1)
Filter: ((employee_id >= '1'::numeric) AND (employee_id <= '1000'::numeric))
Buffers: shared hit=2269
-> Seq Scan on sales_max s_5 (actual time=0.014..5.732 rows=33917 loops=1)
Filter: ((employee_id >= '1'::numeric) AND (employee_id <= '1000'::numeric))
Buffers: shared hit=350
Planning Time: 0.241 ms
Execution Time: 893.444 ms
In the above example, we can not notice any performance gains when we substituted EXTRACT for TO_CHAR and ::TIMESTAMP because they were executed only 143 times. However unnecessay function calls are insidious. As the number of function calls rises, the performance gap will become apparent.
If you replace TO_CHAR and ::TIMESTAMP with EXTRACT in the original query, you will be able to observe the performance gap. I did not record the test results here as I leave it as an exercise for the readers.
Conclusion
1. To improve query performance when joining a fact table and a dimension table in a Data Warehouse, it is recommended to reduce the number of rows participating in the join. Placing GROUP BY before the join can be an effctive approach.
2. Be cautious about extracting a year from a date column and try to minimize the number of function calls.