설문조사
PostgreSQL/PPAS 관련 듣고 싶은 교육은


총 게시물 169건, 최근 0 건
   

to_char vs. date_trunc

글쓴이 : 모델광 날짜 : 2023-05-27 (토) 14:34 조회 : 1126
A few weeks ago, I was troubleshooting some PostgreSQL performance issue for a client. The client was migrating the database from Oracle to PostgreSQL.
A query in question had the following structure - I will post the code to create the EMPLOYEE table at the end of this note:


SELECT TO_CHAR(HIREDATE, 'YYYY'), COUNT(*)
   FROM EMPLOYEE
GROUP BY TO_CHAR(HIREDATE,'YYYY');

​Here is the execution plan I obtained by running the EXPLAIN command.

HashAggregate  (cost=285.00..410.00 rows=10000 width=40)
                                    (actual time=4.128..4.167 rows=28 loops=1)
  Group Key: to_char(hiredate, 'YYYY'::text)
  Batches: 1  Memory Usage: 409kB
  Buffers: shared hit=110
  ->  Seq Scan on employee  (cost=0.00..235.00 rows=10000 width=32)
                                                      (actual time=0.011..2.755 rows=10000 loops=1)
        Buffers: shared hit=110
Planning Time: 0.046 ms
Execution Time: 4.209 ms

Can you notice any inefficiencies in the plan? I couldn't. So I checked how the plan was in Oracle. However, the Oracle plan was the same as the PostgreSQL plan, and the elapsed time was also almost the same. I thought the client was rasing the bar too high after the migration.
In the end, I reported to the client that there was no way to improve the performance of the query. (Although we could make the HIREDATE column not null and create an index on it, that option was not available.)

But, a few weeks later the same developer who had asked me to investigate the query explored an alternative fuction for converting the date value of the TIMESTAMP data type. Eventually, he decided to use DATE_TRUNC.
Here is the query he modified to make the query go faster followed by its execution plan:

SELECT TO_CHAR(DATE_TRUNC('YEAR', HIREDATE),'YYYY'), COUNT(*)
   FROM EMPLOYEE
GROUP BY DATE_TRUNC('YEAR', HIREDATE);

HashAggregate  (cost=285.00..460.00 rows=10000 width=48)
                                     (actual time=2.878..2.922 rows=28 loops=1)
  Group Key: date_trunc('YEAR'::text, hiredate)
  Batches: 1  Memory Usage: 409kB
  Buffers: shared hit=110
  ->  Seq Scan on employee  (cost=0.00..235.00 rows=10000 width=16)
                                                       (actual time=0.009..1.826 rows=10000 loops=1)
        Buffers: shared hit=110
Planning Time: 0.047 ms
Execution Time: 2.967 ms

Both plans of the original query and the rewritten query utilize a HashAggregate operation to perform the grouping, which creates a hash table to group the rows based on the specified key (to_char(hiredate,'YYYY') or date_trunc('year', hiredate)). The Seq Scan operation scans the entire EMPLOYEE table to retrive the rows.
Even though the estimated cost is higher in the rewritten query, the elapsed time decreased from 4.2 to 2.9 ms. Note that the original query took 2.755 mili seconds to perform the Seq Scan, while the modified query only took 1.826 mili seconds. In the first query to_char(hiredate,'yyyy') was executed 10,000 times, whereas date_trunc('year', hiredate) was executed 10,000 times in the second query.

Comparing the two queries, the DATE_TRUNC function is more efficient for grouping by year because it operates directly on the date data type and performs a simple truncation without any string manipulation. On the other hand, the TO_CHAR function requires converting the date to a string  using a specific format, which incurs addtional overhead.

In general, performing operations directly on the native data types, like using date functions, is more efficient than converting data types and performing string manipulations. Therefore, the query with 'DATE_TRUNC' is likely faster than the query with 'TO_CHAR' for grouping by year.

Conclusion
When performing aggregations on time data, we should avoid unnecessary data type conversions.

Source code
If you want to do further experiments, here is the script I used to generate some test data:


CREATE TABLE employee (
    empno numeric(5,0) NOT NULL,
    ename character varying(10),
    job character varying(9),
    mgr numeric(5,0),
    hiredate timestamp(0),
    sal numeric(7,2),
    comm numeric(7,2),
    deptno numeric(2,0),
    sido_nm character varying(100)
);

insert into employee
select i, chr(65+mod(i,26))||i::text||'NM'
      ,case when mod(i,10000)=0 then 'PRESIDENT'
            when mod(i,1000) = 0 then 'MANAGER'
            when mod(i,3)=0 then 'SALESMAN'
            when mod(i,3)=1 then 'ANALYST'
            when mod(i,3)=2 then 'CLERK'
        end as job
      ,case when mod(i,10000)= 0 then null
            when mod(i,1000)= 1 then 10000
            when i >= 9000 then 1000
            else ceiling((i+1000)/1000)*1000
        end as mgr
      , current_date - i
      , trunc(random() * 10000) as sal
      , trunc(random() * 10000) as com
      , mod(i,12)+1            as deptno
      , case when mod(i,3) = 0 then 'Jeonbuk'
            when mod(i,3) = 1 then 'Kangwon'
            else                  'Chungnam'
        end as sido_nm
from generate_series(1,10000) a(i);

ALTER TABLE employee ADD CONSTRAINT employee_pk
  PRIMARY KEY (empno);

   

postgresdba.com