PostgreSQL 은 특별합니다.
단순한 데이터베이스가 아니라, 하나의 어플리케이션 플랫폼이기 때문입니다.

공지사항

뉴스 및 기사

	총 게시물 184건, 최근 1 건

COUNT(*) vs. COUNT(COLUMN_NAME)

글쓴이 : 모델광 날짜 : 2023-02-19 (일) 15:58 조회 : 3976

I am currently working on a project that involves transforming Oracle SQL statements into equivalent PostgreSQL statements. During my work, I noticed a slight performance degradation after running q query with the COUNT(column_name) function in PostgreSQL.
This article is intended to provide you with some tips on how to use the COUNT function in PostgreSQL efficiently.
I am going to start with a simple script to create a test table and insert some data.

create table online_order (
ord_no numeric(10,0)      not null,
cust_no numeric          not null references customer,
ord_date timestamp(0)    not null,
ord_dt   varchar(8)        not null,
ord_status_cd varchar(1) not null,
comment varchar(100)
);

insert into online_order
select i, mod(i,1000000) as cust_no
      ,current_date - mod(i,1000) as ord_date
      ,to_char((current_date - mod(i,1000)),'yyyymmdd') as ord_dt
      ,(mod(i,4) + 1) as ord_status_cd
      ,lpad('x',100,'x')
from generate_series(1,2000000,2) a(i);

alter table online_order add constraint online_order_pk primary key (ord_no);

Then I am going to create an index on the column CUST_NO and ORD_STATUS_CD.

CREATE INDEX ONLINE_ORDER_X02 ON ONLINE_ORDER(CUST_NO, ORD_STATUS_CD);

Here is the query we need to port from Oracle to PostgreSQL.

SELECT ORD_STATUS_CD, COUNT(ORD_DATE)
  FROM ONLINE_ORDER
 WHERE CUST_NO IN (55,60)
GROUP BY ORD_STATUS_CD;

Here is the output produced by running the EXPLAIN command in PostgreSQL 15.1.

GroupAggregate (cost=166.86..167.20 rows=4 width=10) (actual time=0.088..0.092 rows=2 loops=1)
Group Key: ord_status_cd
Buffers: shared hit=27
-> Sort (cost=166.86..166.96 rows=40 width=10) (actual time=0.077..0.080 rows=40 loops=1)
        Sort Key: ord_status_cd
        Sort Method: quicksort Memory: 27kB
        Buffers: shared hit=27
        -> Bitmap Heap Scan on online_order (cost=9.17..165.80 rows=40 width=10) (actual time=0.032..0.062 rows=40 loops=1)
              Recheck Cond: (cust_no = ANY ('{55,60}'::integer[]))
              Heap Blocks: exact=21
              Buffers: shared hit=27
              -> Bitmap Index Scan on online_order_x02 (cost=0.00..9.16 rows=40 width=0) (actual time=0.021..0.022 rows=40 loops=1)
                    Index Cond: (cust_no = ANY ('{55,60}'::integer[]))
                    Buffers: shared hit=6
Planning Time: 0.124 ms
Execution Time: 0.128 ms

The following is a quick description of the plan:

1. The first node, GroupAggregate, calls the second node, Sort.
2. The Sort node calls the Bitmap Heap Scan node.
3. The Bitmap Heap Scan node calls the Bitmap Index Scan node asking for a row from the Bitmap Index Scan node.
4. Bitmap Index Scan scans the ONLINE_ORDER_X02 index as per the condition (cust_no = ANY ('{55,60}'::integer[])) almost in the same way as done in a normal Index Scan. Then it sends the retrieved bitmap to the Bitmap Heap Scan. You may wonder why the optimizer chose to access a bitmap index scan instead of a normal index scan. Complete implementaion details would make this post far longer than it already is, so refer to the following note for detailed explanation.

https://www.postgresdba.com/bbs/board.php?bo_table=C05&wr_id=185&page=3

5. Bitmap Heap Scan reads through the bitmap returned by Bitmap Index Scan in order to get heap data corresponding to stored page number and offset. After reading 21 (Heap Blocks: exact=21) table blocks, PostgreSQL checked the search condition (cust_no = ANY ('{55,60}'::integer[])) again. It finally got 40 rows.
6. The Sort node sorted the 40 rows by the ORD_STATUS_CD.
7. The GoupAggregate node aggregated the sorted 40 rows, resuling in 2 rows.

When I ran the query in PostgreSQL, it was always slower than in Oracle. It was mysterious to me because I could not find any inefficiency in the plan. So I checked the execution plan in Oracle.

Here is the execution plan in Oracle 12.2.

| Id | Operation | Name | Starts| A-Rows | A-Tim | Buffers | OMem | 1Mem | Used-Mem |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT| |   1 |    2 |00:00:00.01 |       6 |       |       |          |
|   1 | HASH GROUP BY    |   |    1 | 2 |00:00:00.01 |       6 | 1600K| 1600K| 489K (0)|
|   2 |   INLIST ITERATOR |     | 1 |     40 |00:00:00.01 |     6 |       |       |        |
|* 3 |    INDEX RANGE SCAN| ON..._X02 | 2 | 40 |00:00:00.01 | 6 | | |        |
----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------
3 - access(("CUST_NO"=55 OR "CUST_NO"=60))

Column Projection Information (identified by operation id):
-----------------------------------------------------------
1 - (rowset=256) "ORD_STATUS_CD"[VARCHAR2,1], COUNT(*)[22]
2 - "ORD_STATUS_CD"[VARCHAR2,1]

3 - "ORD_STATUS_CD"[VARCHAR2,1]

Wow! I was amazed by the plan. Note that the ONLINE_ORDER_X02 index does not have the ORD_DATE column. It consists of the CUST_NO and ORD_STATUS columns. But Oracle did not access the table to get the rows in the ORD_DATE. Instead, it simply counted the number of rows in the ONLINE_ORDER_X02 index. Logically it makes sense not to access the rows in the ORD_DATE column because the ORD_DATE column is not null. When we take a look at the Column Porjection Information section in the plan, we can see that Oracle transformed the original SQL statement as follows:

SELECT ORD_STATUS_CD, COUNT(*)
FROM ONLINE_ORDER
WHERE CUST_NO IN (55,60)
GROUP BY ORD_STATUS_CD;

So I ran the above query in PostgreSQL and I got the following execution plan:

HashAggregate (cost=13.76..13.79 rows=4 width=10) (actual time=0.044..0.045 rows=2 loops=1)
Group Key: ord_status_cd
Batches: 1 Memory Usage: 24kB
Buffers: shared hit=10
-> Index Only Scan using online_order_x02 on online_order (cost=0.43..13.56 rows=40 width=2) (actual time=0.021..0.030 rows=40 loops=1)
        Index Cond: (cust_no = ANY ('{55,60}'::integer[]))
        Heap Fetches: 0
        Buffers: shared hit=10
Planning Time: 0.119 ms
Execution Time: 0.071 ms

You will notice that PostgreSQL now has the same execution plan as Oracle, which resulted in lower number of block I/Os (27 -> 10) and faster elapsed time (0.128 -> 0.071).

Conclusion
When you count the number of rows in PostgreSQL, you should be more careful than in Oracle. Oracle can transform COUNT(column_name) into COUNT(*) when the counted column is not null, which is not possible in PostgreSQL. Therefore, when you use COUNT(column_name) in PostgreSQL, check whether the column is not null. If it is not null then do not use COUNT(column_name) but use COUNT(*).

Footnote
This note also conveys the message that proper attention to data modeling, particluarly with regard to the optionality of an attribute, is crucial for optimizing query performance. Without the not null constraint, Oracle and PostgreSQL would not be able to use COUNT(*).

Addendum
If you are an Oracle user, you may think that COUNT(column_name) is acceptable in Oracle since it can be automatically transformed to COUNT(*) when the column is not null. However, to optimize query performance and lower CPU usage, it is recommended to provide a perfect SQL statement to Oracle, as the transformation of a query can incur additional parsing load.

E-MAIL: root@postgresdba.com