설문조사
PostgreSQL/PPAS 관련 듣고 싶은 교육은


총 게시물 162건, 최근 0 건
   

DATE vs. VARCHAR(8)

글쓴이 : 모델광 날짜 : 2023-02-12 (일) 20:05 조회 : 1077

Last year, I worked as a data modeler for a project. When I joined the project, one of the data architecture principles was not to use VARCHAR2(8) for date columns, but rather to use the DATE data type.

The project utilized Oracle as its DBMS. The DATE data type in Oracle encompasses the date, hour, minute and second, while the DATE data type in PostgreSQL includes only the date. But we can apply the same DA principle in PostgreSQL.

When the principle was announced by the DA team manager, some modelers opposed it, saying "Many developers are familiar with VARCHAR2(8) for date columns. If we use DATE for date columns, they will have to spend more time coding." As you may guess, those arguments were ignored by the DA team manager. And I was happy with the principle because I had frequently come across performance degradation when the VARCHAR2(8) data type was used.

In this note I will provide you with one of the reasons why we should use the DATE data type for date columns.


I will start by building a sample table with both a DATE data type column and a VARCHAR(8) data type column.


create table offline_order (
ord_no numeric(10,0) not null,
cust_no numeric not null,
ord_date date not null,
ord_dt varchar(8) not null,
ord_status_cd varchar(1) not null,
comment varchar(100)
);

I have generated some data for the sample table. Note that I've inserted the "same date data" for both the column ORD_DATE and ORD_DT.

insert into offline_order
select i, mod(i,1000000) as cust_no
,current_date - mod(i,1000) as ord_date
,to_char((current_date - mod(i,1000)),'yyyymmdd') as ord_dt
,(mod(i,4) + 1) as ord_status_cd
,lpad('x',100,'x')
from generate_series(1,2000000,2) a(i);

alter table offline_order add constraint offline_order_pk
primary key (ord_no);

CREATE INDEX OFFLINE_ORDER_X01 ON OFFLINE_ORDER(CUST_NO);

I have run the ANALYZE command to gather statistics for the table.

ANALYZE OFFLINE_ORDER;

Now I have run two queries to show you how the optimizer calculates the estimated row count. I have included the execution plans for each query. The results from this demonstration come from PostgreSQL 15.1.


Query 1. VARCHAR(8) data type

SELECT *
FROM OFFLINE_ORDER
WHERE ORD_DT BETWEEN '20211231' AND '20220102';

Gather (cost=1000.00..28820.80 rows=11738 width=128)
(actual time=0.341..61.820 rows=4000 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=15983 read=4414
-> Parallel Seq Scan on offline_order (cost=0.00..26647.00 rows=4891 width=128) (actual time=0.045..57.304 rows=1333 loops=3)
Filter: (((ord_dt)::text >= '20211231'::text) AND ((ord_dt)::text <= '20220102'::text))
Rows Removed by Filter: 332000
Buffers: shared hit=15983 read=4414
Planning Time: 0.092 ms
Execution Time: 62.188 ms

Query 2. DATE data type


SELECT *
FROM OFFLINE_ORDER
WHERE ORD_DATE BETWEEN '20211231'::DATE AND '20220102'::DATE;


Gather (cost=1000.00..28041.00 rows=3940 width=128)
(actual time=0.374..46.174 rows=4000 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=15990 read=4407
-> Parallel Seq Scan on offline_order (cost=0.00..26647.00 rows=1642 width=128) (actual time=0.037..41.670 rows=1333 loops=3)
Filter: ((ord_date >= '2021-12-31'::date) AND (ord_date <= '2022-01-02'::date))
Rows Removed by Filter: 332000
Buffers: shared hit=15990 read=4407
Planning:
Buffers: shared hit=9 read=2
Planning Time: 0.109 ms
Execution Time: 46.418 ms


There are two things that stand out in the two execution plans.

1) PostgreSQL thinks Query 1 (28820) costs more than Query 2 (28041).
2) Accordingly, Query 2 (46.418 ms) performs faster than Query 1(62.188 ms).


Keep your eyes on the estimated number of rows. In query 1, the optimizer estimated that it would retrieve 11738 rows. Whereas in query 2, the optimizer expected to retrieve 3940 rows. Compared to the actual number of rows returned, the estimation in the second query is more accurate.

It is important to note that the type of join selected by the optimizer, either a hash join or a nested loop join, greately affects query performance and is based on the number of rows estimated. The most common reason for poor execution plans with perceived "good" statistics is inaccurate row count estimates. Many DBMS gurus have put emphasis on the importance of good cardinality estimeates. I want to quote an Oracle expert, Wolfgang Breitling, "If an access plan is not optimal it is because the cardinality estimate for one or more of the row sources is grossly incorrect."

Using the VARCHAR(8) data type for date columns throws the PostgreSQL optimizer into confusion when it estimates the cardinality. You may be frustrated with the PostgreSQL optimizer for misestimating the cardinality in query 1, but this is not the fault of the optimizer, but rather the fault of the designer who chose to use the VARCHAR(8) data type for the date column. By reading this article, you will understand why the incorrect estimate is not the optimizer's fault.

​I will now show you how PostgreSQL arrived at the estimated row numbers of 11738 and 3940 respectively.
I have queried the pg_stats system catalog to view the statistics information for the table.


SELECT attname, n_distinct, most_common_vals, most_common_freqs, histogram_bounds
FROM PG_STATS
WHERE TABLENAME='offline_order';

attname n_distinct most_common_vals most_common_freqs histogram_bounds
ord_no -1.0 NULL NULL {83,21123,...,1999931}
ord_status_cd 2.0 {2,4} {0.50413334,0.49586666}
comment 1.0 {xxxxx.........xxx} {1.0}
cust_no -0.502836 NULL NULL {1,9247,20045,.....,999999}
ord_date 500.0 {2022-02-11,....} {0.0027333333....} {2020-05-18,....,2023-02-10}
ord_dt 500.0 {20220211,....}
{0.0027333333....} {20200518,.....,20230210}


Note that both the ORD_DATE and ORD_DT columns have 500 distinct values and they have values in the most_common_vals and histogram_bounds columns, which indicates that they have a frequency histogram and a height-balanced histogram. Both the ORD_DATE and ORD_DT column statistics have 9 most common values and 100 buckets for height-balanced histogram.

The optimizer takes the following steps to calculate the number of rows:

1. Calculate the ratio of the most common values

with w as (
SELECT UNNEST(MOST_COMMON_FREQS)
FROM PG_STATS
WHERE TABLENAME='offline_order'
AND ATTNAME = 'ord_dt'
)
select sum(unnest) from w;

SUM
----
0.02406666

2. Calculate the number of none-most common values

= (1 - 0.02406666) * 1000000 = 975933

3. Calculate the cardinality of one non-MCV
= the number of Non-MCVs
/ (the number of total distinct values - the number of MCVs' distinct values)
= 975933 / (500 - 9) = 1987.64

Therefore, when you run the following queries which do not have most common values in its WHERE clause condition, the optimizer estimates that it will return 1988 rows.

SELECT *
FROM OFFLINE_ORDER
WHERE ORD_DT = '20220101'; --The data type is VARCHAR(8)​

SELECT *
FROM OFFLINE_ORDER
WHERE ORD_DATE = '20220101'::date; -- The data type is DATE.

​Gather (cost=1000.00..26804.13 rows=1988 width=128)
Workers Planned: 2
-> Parallel Seq Scan on offline_order (cost=0.00..25605.33 rows=828 width=128)
Filter: (ord_date = '2022-01-01'::date)

Now let's see how the optimizer estimates the number of rows in Query 1, which uses the VARCHAR(8) data type.

Firstly, it uses the formula introduced above to estimate the number of rows of the start value of the range, resuling in 1988. Secondly, it subtracts the start value of the range (20211231) from the end value of the range (20220102) to get 8871. Lastly, it addes these two values together (1988 + 8871) to get 10859. Even though the result I calculated is different from the number in the execution plan, this is how the optimzer estimates the rows. Human beings know that there are 3 distinct date values between '20211231' and '20220102'. However the optimizer cannot know this, which is why it uses somewhat exraordinary formula.

Now let's take a look at how the optimizer does the row count estimation in Query 2, which uses the DATE data type. In this case, the optimizer knows that there are only 3 dates between '20211231'::date and '20220102'::date. It uses the following formula:


the number of rows at the start point of the range
+ ((value('20220102'::date) - value('20211231'::date) * the number of Non-MCVs
/ (max(ORD_DATE) - min(ORD_DATE)))
= 1988 + ((2/998) * 975933) = 3944

SELECT '20220102'::DATE - '20211231'::DATE AS DIFF;

diff
----
2



SELECT MAX(ORD_DATE), MIN(ORD_DATE), MAX(ORD_DATE)-MIN(ORD_DATE)
FROM OFFLINE_ORDER;

min max max-min
---------- ----------- ------
2023-02-10 2020-05-18 998


Even though the number 3944 does not match the number 3940 in the execution plan, this is how the optimizer does the estimation when there is a BETWEEN condition in the WHERE clause.

In conclusion, using the VARCHAR(8) data type for date columns limits the ability of the optimizer to accurately estimate the row count, leading to potential performance degradation. It is therefore recommended to use the DATE data type for date columns to ensure optimal performance.

FOOTNOTE
Here are other reasons why you should use a DATE data type for date columns:

1. Space Efficiency: A DATE data type takes up less space in PostgreSQL than a VARCHAR(8) column, as the DATE data type uses a binary representation to store the date, while a VARCHAR(8) column stores the date as a string of characters.

SELECT PG_COLUMN_SIZE('20230210'::DATE), PG_COLUMN_SIZE('20230210');

DATE | VARCHAR(8)
+----+---------
4 | 9

​2. Ease of use: When using a DATE data type, you don't have to worry about converting betwen string and date formats. PostgreSQL provides built-in functions for performing operations such as adding or substracting days from a date, or converting a date to a string represenation.

3. Improved performance: PostgreSQL can perform date operations much faster on columns of type DATE than on columns of type VARCHAR(8), as the database is optimized to work with dates stored in a binary format.

4. Consistency: By using a consistent data type for date columns, you can help ensure that data is entered in a consistent format, which can reduce the likelihood of date errors.


​FOOTNOTE 2
I owe much of this note to the books "PostgreSQL 9.6 Performance Story" and "Optimizing Oracle Optimizer".

ADDENDUM
If you take a close look at the estimated costs of Query 1(28820.80) and Query 2(28041.00), you will see that PostgreSQL believes that Query 2 performs better. In fact, if you repeatedly run both Query 1 and Query 2, you will find that Query 2 consistently performs faster. This suggests that the size of the column plays a role in the performance.


   

postgresdba.com