설문조사
PostgreSQL/PPAS 관련 듣고 싶은 교육은


총 게시물 169건, 최근 0 건
   

KEEP DENSE_RANK in PostgreSQL

글쓴이 : 모델광 날짜 : 2021-06-09 (수) 09:53 조회 : 6074

Last year I was involved in a project where we had to change the DBMS from Oracle to PostgreSQL. Despite the existence of standards, some ORACLE SQL code required at least some changes before being ported to PostgreSQL. 

After being ported some SQL statements suffered degradation in performance.  The following is one of those examples showing how we changed Oracle code and maintained its performance in PostgreSQL.


Below is the SQL statement a developer converted from Oracle code.


SELECT EMPLOYEE_ID, SALE_ID, SALE_DATE, EUR_VALUE, PRODUCT_ID

  FROM SALES A

 WHERE (EMPLOYEE_ID, SALE_DATE) 

                                                       IN (SELECT EMPLOYEE_ID, MAX(SALE_DATE)

                                                               FROM SALES

                                                           GROUP BY EMPLOYEE_ID)

ORDER BY EMPLOYEE_ID;


The developer complained that the elapsed time of the above query increased in PostgreSQL. 

My first thinking about the SQL statement above was :

1. I may be able to improve the performance by removing the "IN" subquery.

2. Is there a way to fetch the resulting rows without visiting the SALES table twice?


In this note, I will show you how you can improve the query performance.


Below is the script I created to do some experiments.


CREATE TABLE employees ( employee_id NUMERIC NOT NULL, first_name VARCHAR(1000) NOT NULL, last_name VARCHAR(1000) NOT NULL, date_of_birth DATE , phone_number VARCHAR(1000) NOT NULL, junk CHAR(1000) , CONSTRAINT employees_pk PRIMARY KEY (employee_id) );


CREATE FUNCTION random_string(minlen NUMERIC, maxlen NUMERIC) RETURNS VARCHAR(1000) AS $$ DECLARE rv VARCHAR(1000) := ''; i INTEGER := 0; len INTEGER := 0; BEGIN IF maxlen < 1 OR minlen < 1 OR maxlen < minlen THEN RETURN rv; END IF; len := floor(random()*(maxlen-minlen)) + minlen; FOR i IN 1..floor(len) LOOP rv := rv || chr(97+CAST(random() * 25 AS INTEGER)); END LOOP; RETURN rv; END; $$ LANGUAGE plpgsql;


INSERT INTO employees (employee_id, first_name, last_name, date_of_birth, phone_number, junk) SELECT GENERATE_SERIES , initcap(lower(random_string(2, 8))) , initcap(lower(random_string(2, 8))) , CURRENT_DATE - CAST(floor(random() * 365 * 10 + 40 * 365) AS NUMERIC) * INTERVAL '1 DAY' , CAST(floor(random() * 9000 + 1000) AS NUMERIC) , 'junk' FROM GENERATE_SERIES(1, 1000);


drop table sales; CREATE TABLE sales ( sale_id NUMERIC NOT NULL, employee_id NUMERIC NOT NULL, subsidiary_id NUMERIC NOT NULL, sale_date timestamp NOT NULL, eur_value NUMERIC(17,2) NOT NULL, product_id BIGINT NOT NULL, quantity INTEGER NOT NULL, CHANNEL VARCHAR(4) NOT NULL, CONSTRAINT sales_pk PRIMARY KEY (sale_id) ); SELECT SETSEED(0); INSERT INTO sales (sale_id , subsidiary_id, employee_id , sale_date, eur_value , product_id, quantity , CHANNEL) SELECT row_number() OVER (), data.* FROM ( SELECT gen % 100, e.employee_id , (CURRENT_DATE - CAST(RANDOM()*3650 AS NUMERIC) * INTERVAL '1 DAY') sale_date , CAST(RANDOM()*100000 AS NUMERIC)/100 eur_value , CAST(RANDOM()*25 AS NUMERIC) + 1 product_id , CAST(RANDOM()*5 AS NUMERIC) + 1 quantity , CASE WHEN GEN % 2 = 0 THEN 'ONLI' ELSE 'OFFL' END FROM employees e , GENERATE_SERIES(1, 18000) gen WHERE MOD(employee_id, 7) = 4 ORDER BY sale_date ) data WHERE TO_CHAR(sale_date, 'D') <> '1'; ANALYZE verbose sales; select * from sales; CREATE INDEX SALES_X02 ON SALES (EMPLOYEE_ID);


In the converted query above we can infer that the business requirement would be:

Extract the most recent data for each employee_id.

To double-check the business requirement I asked for the query in Oracle.

The following is the SQL code in Oracle.


SELECT EMPLOYEE_ID

     , MAX(SALE_ID) KEEP (DENSE_RANK LAST ORDER BY SALE_DATE) AS MAX_SALE_ID

     , MAX(SALE_DATE) KEEP (DENSE_RANK LAST ORDER BY SALE_DATE) AS SALE_DATE

     , MAX(EUR_VALUE) KEEP (DENSE_RANK LAST ORDER BY SALE_DATE) AS MAX_VALUE

     , MAX(PRODUCT_ID) KEEP (DENSE_RANK LAST ORDER BY SALE_DATE) AS PRODUCT_ID

  FROM SALES

GROUP BY EMPLOYEE_ID

ORDER BY EMPLOYEE_ID;


The KEEP (DENSE_RANK LAST) clause is not a standard SQL, it is an ORACLE dialect. Among app developers on Oracle it is not a commonly used solution to get the most recent data for a specific column(EMPLOYEE_ID).  But for advanced ORACLE query tuners, the KEEP (DENSE_RANK LAST) was a life savior. In the query above,  Oracle first orders rows by the SALE_DATE (within EMPLOYEE_ID) applying the dense_rank mechanism, that is to say, keeping only the rows that have the highest (last) dense_rank, and then taking the maximum of that subset of the data.

PostgreSQL doesn't have a counterpart function. So we have to use a standard SQL to implement the business requirement.

Extract the most recent data for each employee_id.


The first question to ask in response to this request is:

What if the employee_id has two or more rows at the same time?

Should we fetch a row for the time? Or Should we fetch all the rows for the time? There is a special case to think about the moment you start to turn the natural language request into a formal language specification.


With the SQL code in Oracle we can infer that we should report all of them if any sales per employee_id happen simultaneously.


Implementation number 1:

Let's start with a list showing the most recent SALE_DATE for each EMPLOYEE_ID. And then let's report all sales that we can identify using that list of (EMPLOYEE_ID, SALE_DATE). To do that I will start with a simple aggregate query and use the result it produced in an "IN" subquery:


SELECT EMPLOYEE_ID, SALE_ID, SALE_DATE, EUR_VALUE, PRODUCT_ID

  FROM SALES A

 WHERE (EMPLOYEE_ID, SALE_DATE) 

                                                       IN (SELECT EMPLOYEE_ID, MAX(SALE_DATE)

                                                               FROM SALES

                                                           GROUP BY EMPLOYEE_ID)

ORDER BY EMPLOYEE_ID;


The query above is what the PostgreSQL coder created. Below is the resulting execution plan.


Gather Merge (actual time=776.957..779.738 rows=143 loops=1)

   Workers Planned: 2

   Workers Launched: 2

   Buffers: shared hit=74388 read=25872

   ->  Sort (actual time=773.435..773.445 rows=48 loops=3)

         Sort Key: a.employee_id

         Sort Method: quicksort  Memory: 36kB

         Buffers: shared hit=74388 read=25872

         Worker 0:  Sort Method: quicksort  Memory: 28kB

         Worker 1:  Sort Method: quicksort  Memory: 30kB

         ->  Hash Join (actual time=773.351..773.408 rows=48 loops=3)

               Hash Cond: ((a.employee_id = sales.employee_id) AND (a.sale_date = (

               Buffers: shared hit=74374 read=25872

               ->  Parallel Seq Scan on sales a (actual time=0.029..44.464 rows=735698 loops=3)

                     Buffers: shared hit=16117 read=8936

               ->  Hash (actual time=634.248..634.254 rows=143 loops=3)

                     Buckets: 1024  Batches: 1  Memory Usage: 15kB

                     Buffers: shared hit=58223 read=16936

                     ->  HashAggregate (actual time=634.162..634.183 rows=143 loops=3)

                           Group Key: sales.employee_id

                           Batches: 1  Memory Usage: 48kB

                           Buffers: shared hit=58223 read=16936

                           Worker 0:  Batches: 1  Memory Usage: 48kB

                           Worker 1:  Batches: 1  Memory Usage: 48kB

                           ->  Seq Scan on sales (actual time=0.006..137.369 rows=2207094 loops=3)

                                 Buffers: shared hit=58223 read=16936

 Planning Time: 0.151 ms

 Execution Time: 779.951 ms


The first thing you will notice about the execution plan is that the "IN" subquery collapsed and then the optimizer used a simple hash join to get the final result. It's something that will often happen with an "IN" subquery and that brings us to Implementation 2.


Implementation number 2:

I take the claim as gospel that the optimizer cannot handle subqueries efficiently. I prefer inline views to the "IN" subquery. Surely there are cases where subqueries are more efficient than inline views. But those cases are rare. So I rewrote the query using an inline view.


SELECT *

  FROM SALES A

     , (SELECT EMPLOYEE_ID, MAX(SALE_DATE) AS MAX_DATE

          FROM SALES

        GROUP BY EMPLOYEE_ID) B

 WHERE A.EMPLOYEE_ID = B.EMPLOYEE_ID

   AND A.SALE_DATE = B.MAX_DATE

ORDER BY A.EMPLOYEE_ID;


 Gather Merge (actual time=768.409..771.284 rows=143 loops=1)

   Workers Planned: 2

   Workers Launched: 2

   Buffers: shared hit=70662 read=29598

   ->  Sort (actual time=763.881..763.886 rows=48 loops=3)

         Sort Key: a.employee_id

         Sort Method: quicksort  Memory: 28kB

         Buffers: shared hit=70662 read=29598

         Worker 0:  Sort Method: quicksort  Memory: 30kB

         Worker 1:  Sort Method: quicksort  Memory: 36kB

         ->  Hash Join (actual time=763.799..763.845 rows=48 loops=3)

               Hash Cond: ((a.employee_id = sales.employee_id) AND (a.sale_date = 

               Buffers: shared hit=70648 read=29598

               ->  Parallel Seq Scan on sales a (actual time=0.039..46.634 rows=735698 loops=3)

                     Buffers: shared hit=16154 read=8899

               ->  Hash (actual time=617.055..617.055 rows=143 loops=3)

                     Buckets: 1024  Batches: 1  Memory Usage: 15kB

                     Buffers: shared hit=54460 read=20699

                     ->  HashAggregate (actual time=617.000..617.017 rows=143 loops=3)

                           Group Key: sales.employee_id

                           Batches: 1  Memory Usage: 48kB

                           Buffers: shared hit=54460 read=20699

                           Worker 0:  Batches: 1  Memory Usage: 48kB

                           Worker 1:  Batches: 1  Memory Usage: 48kB

                           ->  Seq Scan on sales (actual time=0.008..139.083 rows=2207094 loops=3)

                                 Buffers: shared hit=54460 read=20699

 Planning Time: 0.249 ms

 Execution Time: 771.377 ms


You will notice, of course, the remarkable similarity between the previous execution plan and this one - actually the plans are the same. Actually the PostgreSQL optimizer was capable of transform the previous query with the "IN" subquery into this query with the inline view. The performance of the two queries are the same.


Implementation number 3:

We might have decided to check every tuple in the table to see if the sale date in that row was the most recent sale date for the employee_id in that row, which we could do by running a correlated subquery to do the check for every tuple in the table.


SELECT *

  FROM SALES A

 WHERE SALE_DATE = (SELECT MAX(SALE_DATE)

                      FROM SALES B

                     WHERE A.EMPLOYEE_ID = B.EMPLOYEE_ID)

ORDER BY A.EMPLOYEE_ID;


 Index Scan using sales_x02 on portal.sales a

   Output: a.sale_id, a.employee_id, a.subsidiary_id, a.sale_date, a.eur_value, a.pro ....

   Filter: (a.sale_date = (SubPlan 1))

   SubPlan 1

     ->  Aggregate

           Output: max(b.sale_date)

           ->  Bitmap Heap Scan on portal.sales b

                 Output: b.sale_id, b.employee_id, b.subsidiary_id, b.sale_date, .....

                 Recheck Cond: (a.employee_id = b.employee_id)

                 ->  Bitmap Index Scan on sales_x02

                       Index Cond: (b.employee_id = a.employee_id)


In the execution plan above I omitted the execution time and block I/Os intentionally. When I ran the query, I couldn't get the results even though I waited for 120 seconds. So I canceled running the query. Since there are 2207094 rows in the table SALES, we can infer that the subquery (SubPlan 1 in the above plan) would be executed 2207094 times.

This test tells us that you should avoid writing a correlated subquery.

If you are not clear about what a correlated subquery is, please refer to this site:

https://en.wikipedia.org/wiki/Correlated_subquery


Implementation number 4:

A conspicuous defect in the three plans so far is that we have to visit the SALES table twice. One of the principles I take as gospel is:

Don't access the same table twice. Try to find a way to visit the table only once in order to fetch rows.

Is there a way we can avoid accessing the SALES table twice? The answer is yes. PostgreSQL offers us various analytic functions.


SELECT EMPLOYEE_ID, SALE_ID, SALE_DATE, EUR_VALUE, PRODUCT_ID

  FROM (

       SELECT EMPLOYEE_ID

            , SALE_ID

            , SALE_DATE

            , EUR_VALUE

            , PRODUCT_ID

            , MAX(SALE_DATE) OVER (PARTITION BY EMPLOYEE_ID) MAX_DATE

        FROM SALES

        ) A

 WHERE A.SALE_DATE = A.MAX_DATE

ORDER BY A.EMPLOYEE_ID;


 Subquery Scan on a (actual time=26.266..3384.268 rows=143 loops=1)

   Filter: (a.sale_date = a.max_date)

   Rows Removed by Filter: 2206951

   Buffers: shared hit=760536 read=897450

   ->  WindowAgg (actual time=24.097..3304.210 rows=2207094 loops=1)

         Buffers: shared hit=760536 read=897450

         ->  Index Scan using sales_x02 on sales (actual time=0.071..2444.642 rows=2207094 loops=1)

               Buffers: shared hit=760536 read=897450

 Planning Time: 0.075 ms

 Execution Time: 3384.413 ms


By adding the analytic max() function I can acquire the necessary data once and post-process it to find the max(sale_date) for each employee_id before discarding all the rows where the row's sale date doesn't match the maximum sale date. The expression "max(sale_date) over (partition by employee_id)" tells PostgreSQL to partition the data by employee_id and find the maximum sale date within the employee_id.

So we have managed to produce the same result with a single Index Scan instead of visiting the SALES twice in every other plan. But there's a great defect here - to be able to partition by employee_id, PostgreSQL has had to fetch every row and column we are interested in and aggregate the data before deriving values for the new column: Consequently, the Index Scan the planner chose to do was a bad idea. We can see that it took 2,444 ms only to process the Index Scan. 

In this case the variation in actual execution time for the plans was dramatically recognizable. Compared to the elapsed time of Implementation 1 and Implementation 2, it appears that the use of analytic functions doesn't give us any benefit.


Implementation number 5:

The resulting execution time above puzzled me. Because I have always insisted and believed that reducing the access to the table is the first thing to do in speeding up queries. So I tried another analytic function.


SELECT EMPLOYEE_ID, SALE_ID, SALE_DATE, EUR_VALUE, PRODUCT_ID

  FROM (SELECT EMPLOYEE_ID

            , SALE_ID

            , SALE_DATE

            , EUR_VALUE

            , PRODUCT_ID

            , ROW_NUMBER() OVER (PARTITION BY EMPLOYEE_ID ORDER BY SALE_DATE DESC) RN

          FROM SALES

        ) A

 WHERE RN = 1


모델광 2021-06-09 (수) 09:55
Implementation number 5:

The resulting execution time above puzzled me. Because I have always insisted and believed that reducing the access to the table is the first thing to do in speeding up queries. So I tried another analytic function.



SELECT EMPLOYEE_ID, SALE_ID, SALE_DATE, EUR_VALUE, PRODUCT_ID

  FROM (SELECT EMPLOYEE_ID

            , SALE_ID

            , SALE_DATE

            , EUR_VALUE

            , PRODUCT_ID

            , ROW_NUMBER() OVER (PARTITION BY EMPLOYEE_ID ORDER BY SALE_DATE DESC) RN

          FROM SALES

        ) A

 WHERE RN = 1


ORDER BY EMPLOYEE_ID;



 Subquery Scan on a (actual time=893.177..2321.557 rows=143 loops=1)

  Filter: (a.rn = 1)

  Rows Removed by Filter: 2206951

  Buffers: shared hit=16240 read=8956, temp read=17573 written=17628

  ->  WindowAgg (actual time=893.173..2228.378 rows=2207094 loops=1)

        Buffers: shared hit=16240 read=8956, temp read=17573 written=17628

        ->  Gather Merge (actual time=893.126..1542.316 rows=2207094 loops=1)

              Workers Planned: 2

              Workers Launched: 2

              Buffers: shared hit=16240 read=8956, temp read=17573 written=17628

              ->  Sort (actual time=871.813..1026.559 rows=735698 loops=3)

                    Sort Key: sales.employee_id, sales.sale_date DESC

                    Sort Method: external merge  Disk: 36184kB

                    Buffers: shared hit=16240 read=8956, temp read=17573 written=17628

                    Worker 0:  Sort Method: external merge  Disk: 35864kB

                    Worker 1:  Sort Method: external merge  Disk: 36104kB

                    ->  Parallel Seq Scan on sales (actual time=0.012..89.533 rows=735698 loops=3)

                          Buffers: shared hit=16104 read=8949

 Planning:

  Buffers: shared hit=2 read=3

 Planning Time: 0.192 ms

 Execution Time: 2326.598 ms



By using the analytic row_number() function, I don't have to visit the SALES table twice.

Let's take a look at the difference between this execution plan and the previous one. This time the planner chose to do a Parallel Seq Scan instead of doing an Index Scan, which is the reason why this query was faster that the previous one. Because the optimizer didn't access the index, it had to sort the data by (employee_id, sale_date). While scanning the SALES table, it took only 89 ms, but the advantage was compromised by the sort operation. Three workers (two worker nodes and one leader node) sorted 735,698 rows per each worker. While sorting, it spilled to disk.

I have no idea why the planner chooses to do a Seq Scan when we use the row_number() function and why it chooses to do an Index Scan when we use the max() function. I would appreciate it if someone explains the mechanism.

Anyway, to my disappointment, we cannot get a significant improvement in performance by using an analytic function.



Implementation number 6:

I may be too stubborn in sticking to the principle "Reduce the access to the table." Below is the SQL statement I re-engineered.



SELECT EMPLOYEE_ID

    , (MAX(ARRAY[SALE_DATE::TEXT,SALE_ID::TEXT,EUR_VALUE::TEXT, PRODUCT_ID::TEXT]))[1]

    , (MAX(ARRAY[SALE_DATE::TEXT,SALE_ID::TEXT,EUR_VALUE::TEXT, PRODUCT_ID::TEXT]))[2]

    , (MAX(ARRAY[SALE_DATE::TEXT,SALE_ID::TEXT,EUR_VALUE::TEXT, PRODUCT_ID::TEXT]))[3]

    , (MAX(ARRAY[SALE_DATE::TEXT,SALE_ID::TEXT,EUR_VALUE::TEXT, PRODUCT_ID::TEXT]))[4]

  FROM SALES

GROUP BY EMPLOYEE_ID

ORDER BY EMPLOYEE_ID;



 Finalize GroupAggregate (actual time=572.208..574.427 rows=143 loops=1)

  Group Key: employee_id

  Buffers: shared hit=16370 read=8881

  ->  Gather Merge (actual time=572.196..574.283 rows=429 loops=1)

        Workers Planned: 2

        Workers Launched: 2

        Buffers: shared hit=16370 read=8881

        ->  Sort (actual time=567.070..567.077 rows=143 loops=3)

              Sort Key: employee_id

              Sort Method: quicksort  Memory: 45kB

              Buffers: shared hit=16370 read=8881

              Worker 0:  Sort Method: quicksort  Memory: 45kB

              Worker 1:  Sort Method: quicksort  Memory: 45kB

              ->  Partial HashAggregate (actual time=566.995..567.011 rows=143 loops=3)

                    Group Key: employee_id

                    Batches: 1  Memory Usage: 64kB

                    Buffers: shared hit=16356 read=8881

                    Worker 0:  Batches: 1  Memory Usage: 64kB

                    Worker 1:  Batches: 1  Memory Usage: 64kB

                    ->  Parallel Seq Scan on sales (actual time=0.045..55.801 rows=735698 loops=3)

                          Buffers: shared hit=16172 read=8881

 Planning Time: 0.115 ms

 Execution Time: 574.553 ms



Wow, I was moved by the resulting execution plan. Firstly, PostgreSQL scanned the SALES table in parallel. Three processes ( two worker process and one leader process) scanned the table and hash-aggregated by EMPLOYEE_ID. Unlike the previous execution plan, we cannot witness any disk spills. After the aggregation, we have only 143 rows. And the three processes sorted the intermediate resulting rows by (EMPLOYEE_ID). Sorting 143 rows has no impact on the performance. It took only 0.066 ms (567.077 - 567.011).

Conceptually, the mechanism behind this execution plan is the same with that of KEEP (DENSE_RANK LAST) of Oracle. The expression  (MAX(ARRAY[SALE_DATE::TEXT,SALE_ID::TEXT,EUR_VALUE::TEXT, PRODUCT_ID::TEXT]))[2] for any EMPLOYEE_ID will be the SALE_ID of the row that has the maximum SALE_DATE for that EMPLOYEE and, similarly, (MAX(ARRAY[SALE_DATE::TEXT,SALE_ID::TEXT,EUR_VALUE::TEXT, PRODUCT_ID::TEXT]))[3] ... will also be the values from that same row.



Conclusion

When you are given a business requirement, there are various ways of meeting the business requirement. Get to know various ways of implementing the requirement.

In the test case above, the last query was the best strategy. Remember the absolutely critical point here that depending on how the table is indexed or how the data is stored or how you configured memory parameters, the best strategy can be changed.

Advanced query tuners may be able to come up with the best SQL statement the moment he or she gets the business requirement. But before becoming super ultra skillful tuners, we have to be able to try various strategies.
댓글주소
   

postgresdba.com