설문조사
PostgreSQL/PPAS 관련 듣고 싶은 교육은


총 게시물 169건, 최근 0 건
   

row-by-row solution vs. set-based solution 3

글쓴이 : 모델광 날짜 : 2022-04-23 (토) 08:38 조회 : 1689

An old video uploaded  by Connor Mcdonald in 2020 prompted me to write this article.

https://www.youtube.com/watch?v=Gm8N_yip0Ek&list=PLJMaoEWvHwFJRKQ8TWF17yn94N8lR7L0S&index=9

I believe the video is an excerpt from a presentaion for application coders of beginner's level. So he just introduced the SQL with clause as a way of adopting a procedural approach to writing queries to meet some business requirements. So he did not address the downside of subquery factoring. For more uninformed Postgres users, the SQL with clause or subquery factoring is called CTE in PostgreSQL.

Here is a sample query (with some cosmetic changes) Connor supplied in the video.

WITH dept_sal AS (
   SELECT dept_name, SUM(sal) dept_sal   --Firstly get the total salary paid by each department
     FrOM employee e, dept d
    WHERE e.deptno = d.deptno
   GROUP BY dept_name),
 avg_sal AS (                                         --Secondly get the average of these totals.
    SELECT AVG(dept_sal) avsal
      FROM dept_sal
    )
SELECT *                                             --Lastly list those departments above that average.
  FROM dept_sal d, avg_sal a
 WHERE d.dept_sal > a.avsal
ORDER BY d.dept_name;

Connor said, "Relational model is a rigorous model. But relational model sucks(?) because it is so complicated. We don't think relationally, human beings think in procedural terms. Subquery factoring is an awesom metaphor for solving provblems. Our challenge is to take the procedural thing and turn it into a relational code which is very hard. The SQL with clause lets us adopt a procedural approache to a relational solution. This matches my mindset. This is how I solve problems. And it is up to the database to work out how to best run the query."

And he presented us with the SQL statement above to show how subquery factoring can be used in a stepwise approach.

The following is the mindset the programmer had.

Firstly get the total salary paid by each department

Secondly get the average of these totals.

Lastly list those departments above that average.

I disagree with Connor on twe counts - first that relational model is not something that sucks. you can implement business logic with fewer codes using relational model; second that you should not suppose the database to work out the best way to run the query. The database does not know the data the way you know. If the query is far from being relational, the database can not find the best strategy to get the output from running the query.

If you are a developer who has less than one year experience, you are allowed to write a query with a procedural mindset. However, if you want to be an intemediate level  developer, you should be able to think relationally, that is to say, you should be able to come up with a set based solution because the side effect of a procedure-driven query could be nasty. If you think a relational model sucks, you had better find another job. Software development is not for you.

Getting back to the sample query, if we run the query we get the following execution plan.

Sort  (cost=412.14..412.15 rows=4 width=96)
  Sort Key: d.dept_name
  CTE dept_sal                                                   --this is the intermediate result set
    ->  HashAggregate  (cost=411.27..411.42 rows=12 width=133)
          Group Key: d_1.dept_name
          ->  Hash Join  (cost=1.27..361.27 rows=10000 width=105)
                Hash Cond: (e.deptno = (d_1.deptno)::numeric)
                ->  Seq Scan on employee e  (cost=0.00..210.00 rows=10000 width=9)
                ->  Hash  (cost=1.12..1.12 rows=12 width=105)
                      ->  Seq Scan on dept d_1  (cost=0.00..1.12 rows=12 width=105)
  ->  Nested Loop  (cost=0.27..0.68 rows=4 width=96)
        Join Filter: (d.dept_sal > (avg(dept_sal.dept_sal)))
        ->  Aggregate  (cost=0.27..0.28 rows=1 width=32)
              ->  CTE Scan on dept_sal  (cost=0.00..0.24 rows=12 width=32)
        ->  CTE Scan on dept_sal d  (cost=0.00..0.24 rows=12 width=64)

The inefficiency of the above execution plan is as follows:

PostgreSQL joined the two tables (employee and dept) producing the intermediate result set dept_sal. It then aggregated dept_sal and then joined dept_sal again. It is repeatedly accessing the intermediate result set dept_sal. If the intermediate result set is big, it will incur a lot of disk I/Os. When the intermediate result set is small, this execution plan is acceptable. But When the intermediate result set is big, this execution plan is not acceptable.

Anyway if you are an intermediate level developer, how would you rewrite the sample query?

Here is the query I rewrote without subquery factoring followed by its execution plan.

SELECT *
  FROM (
        SELECT d.dept_name
                 , SUM(sal)    AS dept_sal
                 , AVG(SUM(sal)) OVER () AS avsal
           FROM employee e, dept d
        WHERE e.deptno = d.deptno
        GROUP BY d.dept_name
        ) a
WHERE dept_sal > avsal
ORDER BY dept_name;

Sort  (cost=411.76..411.77 rows=4 width=165)
  Sort Key: a.dept_name
  ->  Subquery Scan on a  (cost=411.27..411.72 rows=4 width=165)
        Filter: (a.dept_sal > a.avsal)
        ->  WindowAgg  (cost=411.27..411.57 rows=12 width=165)
              ->  HashAggregate  (cost=411.27..411.42 rows=12 width=133)
                    Group Key: d.dept_name
                    ->  Hash Join  (cost=1.27..361.27 rows=10000 width=105)
                          Hash Cond: (e.deptno = (d.deptno)::numeric)
                          ->  Seq Scan on employee e  (cost=0.00..210.00 rows=10000 width=9)
                          ->  Hash  (cost=1.12..1.12 rows=12 width=105)
                                ->  Seq Scan on dept d  (cost=0.00..1.12 rows=12 width=105)

Note that the optimizer's estimataion of the cost dropped a little (412.15 -> 411.77).

Early grouping may reduce the query processing cost by reducing the amount of data participainting in joins. In the above execution plan we can see that the 12 rows from the table dept and the 10000 rows from the table employee are participating in the join. On the other hand if we rewrite the query like the following one, we can see that the 12 rows from the table dept are joing to the 6 rows from the intermediate set from the table employee.

Here is the query I rewrote to decrease the amount of data participaing in joins.

SELECT d.dept_name, e.dept_sal, e.avsal
  FROM dept d,

           (SELECT *
               FROM (
                          SELECT deptno, SUM(sal) dept_sal, AVG(SUM(sal)) OVER () as avsal
                             FROM public.employee
                          GROUP BY deptno
                          ) k
               WHERE dept_sal > avsal
                ) e
 WHERE d.deptno = e.deptno
 ORDER BY d.dept_name;

Sort  (cost=261.71..261.72 rows=4 width=165)
  Sort Key: d.dept_name
  ->  Hash Join  (cost=260.50..261.67 rows=4 width=165)
        Hash Cond: ((d.deptno)::numeric = k.deptno)
        ->  Seq Scan on dept d  (cost=0.00..1.12 rows=12 width=105)
        ->  Hash  (cost=260.45..260.45 rows=4 width=69)
              ->  Subquery Scan on k  (cost=260.00..260.45 rows=4 width=69)
                    Filter: (k.dept_sal > k.avsal)
                    ->  WindowAgg  (cost=260.00..260.30 rows=12 width=69)
                          ->  HashAggregate  (cost=260.00..260.15 rows=12 width=37)
                                Group Key: employee.deptno
                                ->  Seq Scan on employee  (cost=0.00..210.00 rows=10000 width=9)

When we observe the plan output, we can notice that the cost was reduced from 411.77 to 261.72.

Obviously, this doesn't mean much on a 10,000 rows table. This will be of interest with huge tables.

The last query may not be readable in comparison with the SQL with clause. But it is worth rewriting the query to help the database server work efficiently.


Summary

Let me recap on what I discussed so far.

Although the end results of all queries above appear identical, one can figure out PostgreSQL does better with the set based algorithm than it does with a procedural approach in terms of performance.


Footnote

I am not saying that subuery factoring always produces a suboptimal execution plan. Sometimes I use the SQL with clause to force the join order or to meterialize intermediate results.


Added on Jan 28, 2024
I have found another wonderful YouTube video uploaded by Connor Mcdonald.
The video demonstrates the absurdity of writing application codes that process data row by row:
For faster database code .... just follow the real world - YouTube

Here are codes presented in the video to optimize row-by-row data processing.

SQL Plus
set arraysize 100

Proc*C
$proc prefetch = 100 ...

ODP.net
reader.Fetchsize = cmd.RowSize * 100;

JDBC
conn.SetDefaultRowPrefetch(100);


These codes achieve optimization by prefetching multiple rows in a single database call, thus reducing network overhead and improving performance.


   

postgresdba.com