An old video uploaded by Connor Mcdonald in 2020 prompted me to write this article.
https://www.youtube.com/watch?v=Gm8N_yip0Ek&list=PLJMaoEWvHwFJRKQ8TWF17yn94N8lR7L0S&index=9
I believe the video is an excerpt from a presentaion for application coders of beginner's level. So he just introduced the SQL with clause as a way of adopting a procedural approach to writing queries to meet some business requirements. So he did not address the downside of subquery factoring. For more uninformed Postgres users, the SQL with clause or subquery factoring is called CTE in PostgreSQL.
Here is a sample query (with some cosmetic changes) Connor supplied in the video.
WITH dept_sal AS (
SELECT dept_name, SUM(sal) dept_sal --Firstly get the total salary paid by each department
FrOM employee e, dept d
WHERE e.deptno = d.deptno
GROUP BY dept_name),
avg_sal AS ( --Secondly get the average of these totals.
SELECT AVG(dept_sal) avsal
FROM dept_sal
)
SELECT * --Lastly list those departments above that average.
FROM dept_sal d, avg_sal a
WHERE d.dept_sal > a.avsal
ORDER BY d.dept_name;
Connor said, "Relational model is a rigorous model. But relational model sucks(?) because it is so complicated. We don't think relationally, human beings
think in procedural terms. Subquery factoring is an awesom metaphor for
solving provblems. Our challenge is to take the procedural thing and
turn it into a relational code which is very hard. The SQL with clause lets us adopt a procedural approache to a relational solution. This
matches my mindset. This is how I solve problems. And it is up to the
database to work out how to best run the query."
And he presented us with the SQL statement above to show how subquery factoring can be used in a stepwise approach.
The following is the mindset the programmer had.
Firstly get the total salary paid by each department
Secondly get the average of these totals.
Lastly list those departments above that average.
I disagree with Connor on twe counts - first that relational model is not something that sucks. you can implement business logic with fewer codes using relational model; second that you should not suppose the database to work out the best way to run the query. The database does not know the data the way you know.
If the query is far from being relational, the database can not find
the best strategy to get the output from running the query.
If you are a developer who has less than one year experience, you are allowed to write a query with a procedural mindset.
However, if you want to be an intemediate level developer, you should
be able to think relationally, that is to say, you should be able to
come up with a set based solution because the side effect of a
procedure-driven query could be nasty. If you think a relational model
sucks, you had better find another job. Software development is not for
you.
Getting back to the sample query, if we run the query we get the following execution plan.
Sort (cost=412.14..412.15 rows=4 width=96)
Sort Key: d.dept_name
CTE dept_sal --this is the intermediate result set
-> HashAggregate (cost=411.27..411.42 rows=12 width=133)
Group Key: d_1.dept_name
-> Hash Join (cost=1.27..361.27 rows=10000 width=105)
Hash Cond: (e.deptno = (d_1.deptno)::numeric)
-> Seq Scan on employee e (cost=0.00..210.00 rows=10000 width=9)
-> Hash (cost=1.12..1.12 rows=12 width=105)
-> Seq Scan on dept d_1 (cost=0.00..1.12 rows=12 width=105)
-> Nested Loop (cost=0.27..0.68 rows=4 width=96)
Join Filter: (d.dept_sal > (avg(dept_sal.dept_sal)))
-> Aggregate (cost=0.27..0.28 rows=1 width=32)
-> CTE Scan on dept_sal (cost=0.00..0.24 rows=12 width=32)
-> CTE Scan on dept_sal d (cost=0.00..0.24 rows=12 width=64)
The inefficiency of the above execution plan is as follows:
PostgreSQL joined the two tables (employee and dept) producing the intermediate result set dept_sal. It then aggregated dept_sal and then joined dept_sal again. It is repeatedly accessing the intermediate result set dept_sal.
If the intermediate result set is big, it will incur a lot of disk
I/Os. When the intermediate result set is small, this execution plan is
acceptable. But When the intermediate result set is big, this execution
plan is not acceptable.
Anyway if you are an intermediate level developer, how would you rewrite the sample query?
Here is the query I rewrote without subquery factoring followed by its execution plan.
SELECT *
FROM (
SELECT d.dept_name
, SUM(sal) AS dept_sal
, AVG(SUM(sal)) OVER () AS avsal
FROM employee e, dept d
WHERE e.deptno = d.deptno
GROUP BY d.dept_name
) a
WHERE dept_sal > avsal
ORDER BY dept_name;
Sort (cost=411.76..411.77 rows=4 width=165)
Sort Key: a.dept_name
-> Subquery Scan on a (cost=411.27..411.72 rows=4 width=165)
Filter: (a.dept_sal > a.avsal)
-> WindowAgg (cost=411.27..411.57 rows=12 width=165)
-> HashAggregate (cost=411.27..411.42 rows=12 width=133)
Group Key: d.dept_name
-> Hash Join (cost=1.27..361.27 rows=10000 width=105)
Hash Cond: (e.deptno = (d.deptno)::numeric)
-> Seq Scan on employee e (cost=0.00..210.00 rows=10000 width=9)
-> Hash (cost=1.12..1.12 rows=12 width=105)
-> Seq Scan on dept d (cost=0.00..1.12 rows=12 width=105)
Note that the optimizer's estimataion of the cost dropped a little (412.15 -> 411.77).
Early grouping may reduce the query processing cost by reducing the amount of data participainting in joins. In the above
execution plan we can see that the 12 rows from the table dept and the
10000 rows from the table employee are participating in the join. On the
other hand if we rewrite the query like the following one, we can see
that the 12 rows from the table dept are joing to the 6 rows from the
intermediate set from the table employee.
Here is the query I rewrote to decrease the amount of data participaing in joins.
SELECT d.dept_name, e.dept_sal, e.avsal
FROM dept d,
(SELECT *
FROM (
SELECT deptno, SUM(sal) dept_sal, AVG(SUM(sal)) OVER () as avsal
FROM public.employee
GROUP BY deptno
) k
WHERE dept_sal > avsal
) e
WHERE d.deptno = e.deptno
ORDER BY d.dept_name;
Sort (cost=261.71..261.72 rows=4 width=165)
Sort Key: d.dept_name
-> Hash Join (cost=260.50..261.67 rows=4 width=165)
Hash Cond: ((d.deptno)::numeric = k.deptno)
-> Seq Scan on dept d (cost=0.00..1.12 rows=12 width=105)
-> Hash (cost=260.45..260.45 rows=4 width=69)
-> Subquery Scan on k (cost=260.00..260.45 rows=4 width=69)
Filter: (k.dept_sal > k.avsal)
-> WindowAgg (cost=260.00..260.30 rows=12 width=69)
-> HashAggregate (cost=260.00..260.15 rows=12 width=37)
Group Key: employee.deptno
-> Seq Scan on employee (cost=0.00..210.00 rows=10000 width=9)
When we observe the plan output, we can notice that the cost was reduced from 411.77 to 261.72.
Obviously, this doesn't mean much on a 10,000 rows table. This will be of interest with huge tables.
The
last query may not be readable in comparison with the SQL with clause.
But it is worth rewriting the query to help the database server work
efficiently.
Summary
Let me recap on what I discussed so far.
Although the end results of all queries above appear identical, one can figure out PostgreSQL does better with the set based algorithm than it does with a procedural approach in terms of performance.
Footnote
I am not saying that subuery factoring always produces a suboptimal execution plan. Sometimes I use the SQL with clause to force the join order or to meterialize intermediate results.
Added on Jan 28, 2024
I have found another wonderful YouTube video uploaded by Connor Mcdonald.
The video demonstrates the absurdity of writing application codes that process data row by row:
For faster database code .... just follow the real world - YouTube
Here are codes presented in the video to optimize row-by-row data processing.
SQL Plus
set arraysize 100
Proc*C
$proc prefetch = 100 ...
ODP.net
reader.Fetchsize = cmd.RowSize * 100;
JDBC
conn.SetDefaultRowPrefetch(100);
These codes achieve optimization by prefetching multiple rows in a single database call, thus reducing network overhead and improving performance.