Last year I was involved in a project where we had to change the DBMS from Oracle to PostgreSQL. Despite the existence of standards, some ORACLE SQL code required at least some changes before being ported to PostgreSQL.
After being ported some SQL statements suffered degradation in performance. The following is one of those examples showing how we changed Oracle code and maintained its performance in PostgreSQL.
Below is the SQL statement a developer converted from Oracle code.
SELECT EMPLOYEE_ID, SALE_ID, SALE_DATE, EUR_VALUE, PRODUCT_ID
FROM SALES A
WHERE (EMPLOYEE_ID, SALE_DATE)
IN (SELECT EMPLOYEE_ID, MAX(SALE_DATE)
FROM SALES
GROUP BY EMPLOYEE_ID)
ORDER BY EMPLOYEE_ID;
The developer complained that the elapsed time of the above query increased in PostgreSQL.
My first thinking about the SQL statement above was :
1. I may be able to improve the performance by removing the "IN" subquery.
2. Is there a way to fetch the resulting rows without visiting the SALES table twice?
In this note, I will show you how you can improve the query performance.
Below is the script I created to do some experiments.
CREATE TABLE employees ( employee_id NUMERIC NOT NULL, first_name VARCHAR(1000) NOT NULL, last_name VARCHAR(1000) NOT NULL, date_of_birth DATE , phone_number VARCHAR(1000) NOT NULL, junk CHAR(1000) , CONSTRAINT employees_pk PRIMARY KEY (employee_id) );
CREATE FUNCTION random_string(minlen NUMERIC, maxlen NUMERIC) RETURNS VARCHAR(1000) AS $$ DECLARE rv VARCHAR(1000) := ''; i INTEGER := 0; len INTEGER := 0; BEGIN IF maxlen < 1 OR minlen < 1 OR maxlen < minlen THEN RETURN rv; END IF; len := floor(random()*(maxlen-minlen)) + minlen; FOR i IN 1..floor(len) LOOP rv := rv || chr(97+CAST(random() * 25 AS INTEGER)); END LOOP; RETURN rv; END; $$ LANGUAGE plpgsql;
INSERT INTO employees (employee_id, first_name, last_name, date_of_birth, phone_number, junk) SELECT GENERATE_SERIES , initcap(lower(random_string(2, 8))) , initcap(lower(random_string(2, 8))) , CURRENT_DATE - CAST(floor(random() * 365 * 10 + 40 * 365) AS NUMERIC) * INTERVAL '1 DAY' , CAST(floor(random() * 9000 + 1000) AS NUMERIC) , 'junk' FROM GENERATE_SERIES(1, 1000);
drop table sales;
CREATE TABLE sales (
sale_id NUMERIC NOT NULL,
employee_id NUMERIC NOT NULL,
subsidiary_id NUMERIC NOT NULL,
sale_date timestamp NOT NULL,
eur_value NUMERIC(17,2) NOT NULL,
product_id BIGINT NOT NULL,
quantity INTEGER NOT NULL,
CHANNEL VARCHAR(4) NOT NULL,
CONSTRAINT sales_pk
PRIMARY KEY (sale_id)
);
SELECT SETSEED(0);
INSERT INTO sales (sale_id
, subsidiary_id, employee_id
, sale_date, eur_value
, product_id, quantity
, CHANNEL)
SELECT row_number() OVER (), data.*
FROM (
SELECT gen % 100, e.employee_id
, (CURRENT_DATE - CAST(RANDOM()*3650 AS NUMERIC) * INTERVAL '1 DAY') sale_date
, CAST(RANDOM()*100000 AS NUMERIC)/100 eur_value
, CAST(RANDOM()*25 AS NUMERIC) + 1 product_id
, CAST(RANDOM()*5 AS NUMERIC) + 1 quantity
, CASE WHEN GEN % 2 = 0 THEN 'ONLI' ELSE 'OFFL' END
FROM employees e
, GENERATE_SERIES(1, 18000) gen
WHERE MOD(employee_id, 7) = 4
ORDER BY sale_date
) data
WHERE TO_CHAR(sale_date, 'D') <> '1';
ANALYZE verbose sales;
select * from sales;
CREATE INDEX SALES_X02 ON SALES (EMPLOYEE_ID);
In the converted query above we can infer that the business requirement would be:
Extract the most recent data for each employee_id.
To double-check the business requirement I asked for the query in Oracle.
The following is the SQL code in Oracle.
SELECT EMPLOYEE_ID
, MAX(SALE_ID) KEEP (DENSE_RANK LAST ORDER BY SALE_DATE) AS MAX_SALE_ID
, MAX(SALE_DATE) KEEP (DENSE_RANK LAST ORDER BY SALE_DATE) AS SALE_DATE
, MAX(EUR_VALUE) KEEP (DENSE_RANK LAST ORDER BY SALE_DATE) AS MAX_VALUE
, MAX(PRODUCT_ID) KEEP (DENSE_RANK LAST ORDER BY SALE_DATE) AS PRODUCT_ID
FROM SALES
GROUP BY EMPLOYEE_ID
ORDER BY EMPLOYEE_ID;
The KEEP (DENSE_RANK LAST) clause is not a standard SQL, it is an ORACLE dialect. Among app developers on Oracle it is not a commonly used solution to get the most recent data for a specific column(EMPLOYEE_ID). But for advanced ORACLE query tuners, the KEEP (DENSE_RANK LAST) was a life savior. In the query above, Oracle first orders rows by the SALE_DATE (within EMPLOYEE_ID) applying the dense_rank mechanism, that is to say, keeping only the rows that have the highest (last) dense_rank, and then taking the maximum of that subset of the data.
PostgreSQL doesn't have a counterpart function. So we have to use a standard SQL to implement the business requirement.
Extract the most recent data for each employee_id.
The first question to ask in response to this request is:
What if the employee_id has two or more rows at the same time?
Should we fetch a row for the time? Or Should we fetch all the rows for the time? There is a special case to think about the moment you start to turn the natural language request into a formal language specification.
With the SQL code in Oracle we can infer that we should report all of them if any sales per employee_id happen simultaneously.
Implementation number 1:
Let's start with a list showing the most recent SALE_DATE for each EMPLOYEE_ID. And then let's report all sales that we can identify using that list of (EMPLOYEE_ID, SALE_DATE). To do that I will start with a simple aggregate query and use the result it produced in an "IN" subquery:
SELECT EMPLOYEE_ID, SALE_ID, SALE_DATE, EUR_VALUE, PRODUCT_ID
FROM SALES A
WHERE (EMPLOYEE_ID, SALE_DATE)
IN (SELECT EMPLOYEE_ID, MAX(SALE_DATE)
FROM SALES
GROUP BY EMPLOYEE_ID)
ORDER BY EMPLOYEE_ID;
The query above is what the PostgreSQL coder created. Below is the resulting execution plan.
Gather Merge (actual time=776.957..779.738 rows=143 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=74388 read=25872
-> Sort (actual time=773.435..773.445 rows=48 loops=3)
Sort Key: a.employee_id
Sort Method: quicksort Memory: 36kB
Buffers: shared hit=74388 read=25872
Worker 0: Sort Method: quicksort Memory: 28kB
Worker 1: Sort Method: quicksort Memory: 30kB
-> Hash Join (actual time=773.351..773.408 rows=48 loops=3)
Hash Cond: ((a.employee_id = sales.employee_id) AND (a.sale_date = (
Buffers: shared hit=74374 read=25872
-> Parallel Seq Scan on sales a (actual time=0.029..44.464 rows=735698 loops=3)
Buffers: shared hit=16117 read=8936
-> Hash (actual time=634.248..634.254 rows=143 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 15kB
Buffers: shared hit=58223 read=16936
-> HashAggregate (actual time=634.162..634.183 rows=143 loops=3)
Group Key: sales.employee_id
Batches: 1 Memory Usage: 48kB
Buffers: shared hit=58223 read=16936
Worker 0: Batches: 1 Memory Usage: 48kB
Worker 1: Batches: 1 Memory Usage: 48kB
-> Seq Scan on sales (actual time=0.006..137.369 rows=2207094 loops=3)
Buffers: shared hit=58223 read=16936
Planning Time: 0.151 ms
Execution Time: 779.951 ms
The first thing you will notice about the execution plan is that the "IN" subquery collapsed and then the optimizer used a simple hash join to get the final result. It's something that will often happen with an "IN" subquery and that brings us to Implementation 2.
Implementation number 2:
I take the claim as gospel that the optimizer cannot handle subqueries efficiently. I prefer inline views to the "IN" subquery. Surely there are cases where subqueries are more efficient than inline views. But those cases are rare. So I rewrote the query using an inline view.
SELECT *
FROM SALES A
, (SELECT EMPLOYEE_ID, MAX(SALE_DATE) AS MAX_DATE
FROM SALES
GROUP BY EMPLOYEE_ID) B
WHERE A.EMPLOYEE_ID = B.EMPLOYEE_ID
AND A.SALE_DATE = B.MAX_DATE
ORDER BY A.EMPLOYEE_ID;
Gather Merge (actual time=768.409..771.284 rows=143 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=70662 read=29598
-> Sort (actual time=763.881..763.886 rows=48 loops=3)
Sort Key: a.employee_id
Sort Method: quicksort Memory: 28kB
Buffers: shared hit=70662 read=29598
Worker 0: Sort Method: quicksort Memory: 30kB
Worker 1: Sort Method: quicksort Memory: 36kB
-> Hash Join (actual time=763.799..763.845 rows=48 loops=3)
Hash Cond: ((a.employee_id = sales.employee_id) AND (a.sale_date =
Buffers: shared hit=70648 read=29598
-> Parallel Seq Scan on sales a (actual time=0.039..46.634 rows=735698 loops=3)
Buffers: shared hit=16154 read=8899
-> Hash (actual time=617.055..617.055 rows=143 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 15kB
Buffers: shared hit=54460 read=20699
-> HashAggregate (actual time=617.000..617.017 rows=143 loops=3)
Group Key: sales.employee_id
Batches: 1 Memory Usage: 48kB
Buffers: shared hit=54460 read=20699
Worker 0: Batches: 1 Memory Usage: 48kB
Worker 1: Batches: 1 Memory Usage: 48kB
-> Seq Scan on sales (actual time=0.008..139.083 rows=2207094 loops=3)
Buffers: shared hit=54460 read=20699
Planning Time: 0.249 ms
Execution Time: 771.377 ms
You will notice, of course, the remarkable similarity between the previous execution plan and this one - actually the plans are the same. Actually the PostgreSQL optimizer was capable of transform the previous query with the "IN" subquery into this query with the inline view. The performance of the two queries are the same.
Implementation number 3:
We might have decided to check every tuple in the table to see if the sale date in that row was the most recent sale date for the employee_id in that row, which we could do by running a correlated subquery to do the check for every tuple in the table.
SELECT *
FROM SALES A
WHERE SALE_DATE = (SELECT MAX(SALE_DATE)
FROM SALES B
WHERE A.EMPLOYEE_ID = B.EMPLOYEE_ID)
ORDER BY A.EMPLOYEE_ID;
Index Scan using sales_x02 on portal.sales a
Output: a.sale_id, a.employee_id, a.subsidiary_id, a.sale_date, a.eur_value, a.pro ....
Filter: (a.sale_date = (SubPlan 1))
SubPlan 1
-> Aggregate
Output: max(b.sale_date)
-> Bitmap Heap Scan on portal.sales b
Output: b.sale_id, b.employee_id, b.subsidiary_id, b.sale_date, .....
Recheck Cond: (a.employee_id = b.employee_id)
-> Bitmap Index Scan on sales_x02
Index Cond: (b.employee_id = a.employee_id)
In the execution plan above I omitted the execution time and block I/Os intentionally. When I ran the query, I couldn't get the results even though I waited for 120 seconds. So I canceled running the query. Since there are 2207094 rows in the table SALES, we can infer that the subquery (SubPlan 1 in the above plan) would be executed 2207094 times.
This test tells us that you should avoid writing a correlated subquery.
If you are not clear about what a correlated subquery is, please refer to this site:
https://en.wikipedia.org/wiki/Correlated_subquery
Implementation number 4:
A conspicuous defect in the three plans so far is that we have to visit the SALES table twice. One of the principles I take as gospel is:
Don't access the same table twice. Try to find a way to visit the table only once in order to fetch rows.
Is there a way we can avoid accessing the SALES table twice? The answer is yes. PostgreSQL offers us various analytic functions.
SELECT EMPLOYEE_ID, SALE_ID, SALE_DATE, EUR_VALUE, PRODUCT_ID
FROM (
SELECT EMPLOYEE_ID
, SALE_ID
, SALE_DATE
, EUR_VALUE
, PRODUCT_ID
, MAX(SALE_DATE) OVER (PARTITION BY EMPLOYEE_ID) MAX_DATE
FROM SALES
) A
WHERE A.SALE_DATE = A.MAX_DATE
ORDER BY A.EMPLOYEE_ID;
Subquery Scan on a (actual time=26.266..3384.268 rows=143 loops=1)
Filter: (a.sale_date = a.max_date)
Rows Removed by Filter: 2206951
Buffers: shared hit=760536 read=897450
-> WindowAgg (actual time=24.097..3304.210 rows=2207094 loops=1)
Buffers: shared hit=760536 read=897450
-> Index Scan using sales_x02 on sales (actual time=0.071..2444.642 rows=2207094 loops=1)
Buffers: shared hit=760536 read=897450
Planning Time: 0.075 ms
Execution Time: 3384.413 ms
By adding the analytic max() function I can acquire the necessary data once and post-process it to find the max(sale_date) for each employee_id before discarding all the rows where the row's sale date doesn't match the maximum sale date. The expression "max(sale_date) over (partition by employee_id)" tells PostgreSQL to partition the data by employee_id and find the maximum sale date within the employee_id.
So we have managed to produce the same result with a single Index Scan instead of visiting the SALES twice in every other plan. But there's a great defect here - to be able to partition by employee_id, PostgreSQL has had to fetch every row and column we are interested in and aggregate the data before deriving values for the new column: Consequently, the Index Scan the planner chose to do was a bad idea. We can see that it took 2,444 ms only to process the Index Scan.
In this case the variation in actual execution time for the plans was dramatically recognizable. Compared to the elapsed time of Implementation 1 and Implementation 2, it appears that the use of analytic functions doesn't give us any benefit.
Implementation number 5:
The resulting execution time above puzzled me. Because I have always insisted and believed that reducing the access to the table is the first thing to do in speeding up queries. So I tried another analytic function.
SELECT EMPLOYEE_ID, SALE_ID, SALE_DATE, EUR_VALUE, PRODUCT_ID
FROM (SELECT EMPLOYEE_ID
, SALE_ID
, SALE_DATE
, EUR_VALUE
, PRODUCT_ID
, ROW_NUMBER() OVER (PARTITION BY EMPLOYEE_ID ORDER BY SALE_DATE DESC) RN
FROM SALES
) A
WHERE RN = 1