설문조사
PostgreSQL/PPAS 관련 듣고 싶은 교육은


총 게시물 167건, 최근 0 건
   

Attribute Ordering

글쓴이 : 모델광 날짜 : 2024-05-18 (토) 07:27 조회 : 341
In data modeling, there are no strict rules that dictate the order of attributes within an entity. However, when I am tasked with modeling data, I typically follow somewhat flexible guidelines for attribute sequencing. Here are some of them:

1) The identifier, which represents the primary key in the physical model, must be displayed first.
2) If certain attributes are expected to be frequently used in queries, you should position them near the top for easier access.
3) Attributes that hold significant importance in the business context may be placed near the top. For example, in a SALES entity, you might position the sales amount attribute prominently.
4) Organize attributes logically by grouping related or similar attributes together. For example, in a CUSTOMER entity, you can group attributes related to contact information (e.g., name, address, phone number).
5) Attributes that are expected to contain many null values should be placed towards the bottom.
6) The audit attributes, including CREATE_USER_NAME, CREATE_DATE, MODIFY_USER_NAME, and MODIFY_DATE, should be displayed last and consistently maintain the same order in all entities.

I learned most of these guidelines from various data modeling courses, particularly those instructed by professionals with a strong background in Oracle. After working with PostgreSQL for about five years, I recently realized that it is essential to consider the data type of the attribute when determining the attribute order. In Oracle, the placement of the attribute's data type has no impact on the size of the table. However, in PostgreSQL, the order in which you arrange the columns has a significant effect on the table's size.

In this note, I will explain why you should consider attribute sequencing in PostgreSQL to minimize the table size. PostgreSQL is designed such that its own internal natural alignment is 8 bytes, meaning consecutive fixed-length columns of differing sizes must be padded with empty bytes in some cases. We can observe this with the following example:

SELECT pg_column_size(row())            AS empty,
       pg_column_size(row(0::SMALLINT)) AS byte2,
       pg_column_size(row(0::INT))      AS byte4,
       pg_column_size(row(0::BIGINT))   AS byte8,
       pg_column_size(row(0::SMALLINT, 0::BIGINT)) AS byte16,
       pg_column_size(row(0::BIGINT, 0::SMALLINT)) AS byte10,
       PG_COLUMN_SIZE(ROW(129::int, '20220101'::timestamp)) AS bytes_16,
       PG_COLUMN_SIZE(ROW('20220101'::timestamp, 129::int)) AS bytes_12;

empty|byte2|byte4|byte8|byte16|byte10|bytes_16|bytes_12|
-----+-----+-----+-----+------+------+--------+--------+
   24|   26|   28|   32|    40|    34|      40|      36|

The result suggests that an empty PostgreSQL row requires 24 bytes of header elements. Let's compare (0::SMALLINT, 0::BIGINT) with (0::BIGINT, 0::SMALLINT). The first arrangement of columns occupies 16 bytes, while the second alignment of columns takes up 10 bytes. In the first arrangement of columns, I have a SMALLINT which is represented as two bytes in PostgreSQL, then I have a BIGINT which is represented as eight bytes. That 64-bit integer cannot actually fit right next to the SMALLINT. The BIGINT has to align at eight bytes, and in this case, it is at the second byte, so it has to skip the next 6 bytes, and then PostgreSQL can put the BIGINT afterwards. This tells us that if you need to place SMALLINT and BIGINT data types consecutively and you want to minimize the row size, you should position the BIGINT data type first. In a similar vein, if you need to arrange INT and TIMESTAMP consecutively, you should place TIMESTAMP at the beginning.
In conclusion, if your table contains consecutive fixed-length columns, you should place the longer column before the shorter column to reduce the size of the table.

Now, let's examine the NUMERIC and TEXT columns. They are variable-length types.

SELECT pg_column_size(row()) AS empty,
       pg_column_size(row(1::NUMERIC)) AS byte5,
       pg_column_size(row(10.2::NUMERIC)) AS byte7,
       pg_column_size(row(1::SMALLINT, 1::NUMERIC)) AS byte_2_5,
       pg_column_size(row(1::NUMERIC, 1::SMALLINT)) AS round_6_2,
       pg_column_size(row(1::INT4, 1::NUMERIC)) AS byte_4_5,
       pg_column_size(row(1::NUMERIC, 1::INT4)) AS round_8_4;

empty|byte5|byte7|byte_2_5|round_6_2|byte_4_5|round_8_4|
-----+-----+-----+--------+---------+--------+---------+
   24|   29|   31|      31|       32|      33|       36|

Note that (1::SMALLINT, 1::NUMERIC) occupies 7 bytes (2+5), while (1::NUMERIC, 1::SMALLINT) consumes 8 bytes (6+2). In the latter arrangement, 1::NUMERIC was rounded up to 6 bytes due to the data type of the following value. Similarly, in the arrangement of (1::NUMERIC, 1::INT4), 1::NUMERIC was rounded up to 8 bytes. From this, we can see that variable-length types round to the nearest 2 bytes, 4 bytes, or 8 bytes based on the type of the next column.
In conclusion, if your table contains a fixed-length column and a variable-length column consecutively, you should position the fixed-length column before the variable-length column to reduce the table's size.
(This means we can chain variable-length columns all day long without introducing padding except at the right boundary. Consequently, we can deduce that variable-length columns introduce no bloat as long as they are at the end of a column string.)

Let's examine how much space we can save if we place the fixed-length column before the variable-length column in a table. Here is an entity designed based on the modeling guideline aforementioned.
I will start with a script that simply creates a table based on the ERD and populates it with 0.2 million rows.

CREATE TABLE CUSTOMER2
(
    CUST_NO         integer NOT NULL,
    CUST_NM         character varying(10) NOT NULL,
    CUST_STS_CD     character varying(1) NOT NULL,
    BIRTH_DT        date NOT NULL,
    REG_CUST_AGE    smallint NOT NULL,
    POSTAL_CD       character varying(6) NOT NULL,
    BSIC_ADDR       character varying(20) NOT NULL,
    DTL_ADDR        character varying(50) NOT NULL,
    MBLE_PHNE_NO    character varying(11) NOT NULL,
    REG_PATH_CD     character varying(1) NOT NULL,
    REG_DT          date NOT NULL,
    CUST_GRD_CD     character varying(1) NOT NULL,
    TOTL_MLGE_SCRE  integer NOT NULL,
    CUST_AUTH_YN    boolean NOT NULL,
    CUST_AUTH_TIME  timestamp,
    INF_PRNT_AGR_YN boolean NOT NULL,
    INF_PRNT_AGR_TIME timestamp,
    INF_PRNT_THR_AGR_YN boolean NOT NULL,
    INF_PRNT_THJR_AGR_TIME timestamp,
    DOUT_DT         date,
    RGTR_ID         character varying(8) NOT NULL,
    REG_TIME        timestamp without time zone NOT NULL,
    MDFR_ID         character varying(8) NOT NULL,
    MOD_TIME        timestamp without time zone NOT NULL
);

INSERT INTO CUSTOMER2
SELECT i AS CUST_NO,
       upper(substr(md5(random()::text), 1, 3)) AS cust_nm,
       (mod(i, 3) + 1)::text AS cust_sts_cd,
       current_date - 30 - mod(i, 3650) AS birth_dt,
       mod(i, 100) + 1 AS reg_cust_age,
       lpad((mod(i, 90000) + 1)::text, 6, '0'::text) AS postal_cd,
       substr(md5(random()::text), 1, 10) AS bsic_addr,
       substr(md5(random()::text), 1, 10) AS dtl_addr,
       lpad(i::text, 11, '7'::text) AS mble_phne_no,
       (mod(i, 2) + 1)::text AS reg_path_cd,
       current_date - mod(i, 4000) AS reg_dt,
       (mod(i, 5) + 1)::text AS cust_grd_cd,
       mod(i, 2000) * 10 AS totl_mlge_scre,
       mod(i, 2)::boolean AS cust_auth_yn,
       CASE WHEN mod(i, 2) = 0 THEN null ELSE current_date - mod(i, 4000) END AS cust_auth_dt,
       mod(i, 2)::boolean AS inf_prnt_agr_yn,
       CASE WHEN mod(i, 2) = 0 THEN null ELSE current_date - mod(i, 4000) END AS inf_prnt_agr_dt,
       mod(i, 2)::boolean AS inf_prnt_thr_agr_yn,
       CASE WHEN mod(i, 2) = 0 THEN null ELSE current_date - mod(i, 4000) END AS inf_prnt_thr_agr_dt,
       CASE WHEN mod(i, 20) = 0 THEN current_date - mod(i, 100) ELSE null END AS dout_dt,
       'abcdefgh' AS rgtr_id,
       current_timestamp - interval '2000 hour' AS reg_time,
       'abcdefgh' AS mdfr_id,
       current_timestamp - interval '1 day' AS mod_time
FROM generate_series(1, 200000) a(i);

I have run the following query to compute the disk space used by the table:

SELECT pg_table_size('customer2');

pg_table_size|
-------------+
     36454400|

We can rearrange the order of the columns to minimize the table size as follows:

CREATE TABLE CUSTOMER_OPTIMAL AS
SELECT cust_no,
       birth_dt,
       reg_time,
       mod_time,
       totl_mlge_scre,
       reg_dt,
       cust_auth_time,
       inf_prnt_agr_time,
       inf_prnt_thjr_agr_time,
       dout_dt,
       reg_cust_age,
       cust_auth_yn,
       inf_prnt_thr_agr_yn,
       inf_prnt_agr_yn,
       cust_sts_cd,
       postal_cd,
       bsic_addr,
       dtl_addr,
       mble_phne_no,
       mdfr_id,
       reg_path_cd,
       cust_grd_cd,
       cust_nm,
       rgtr_id
FROM CUSTOMER2;

I have run the following query to compute the disk space used by the optimized table:

SELECT pg_table_size('customer_optimal');

pg_table_size|
-------------+
     32800768|


As expected, the table's size has decreased from 36 MB to 32 MB, resulting in around an 11.1% reduction in terms of disk usage.

We may be able to use the reorganized table in the production system. However, we must consider the readability and maintainability of the model when we design a table. This table with the columns reordered is hard to read. Some audit columns, such as REG_TIME and MOD_TIME, are not positioned at the bottom of the table. This requires a developer to scan through all the columns to locate the audit columns. Additionally, it is evident that a customer's name is important data, but the CUST_NM column is positioned at the end of the table due to its variable-length data type.

Therefore, we need to be practical when determining the column order in a table. Readability and maintainability are as important as the table's size. In a project I am working on, I added two items to the attribute sequencing guideline as follows:

1) The audit attributes should be placed at the bottom in the following order:
CREATE DATE, MODIFY DATE, CREATE USER NAME, MODIFY USER NAME
2) When the table's size is of critical consequence, attributes with fixed-length data types should be placed before attributes with variable-length data types.

Let's say that the size of the CUSTOMER table is not critical, so we can redesign the table as follows:

CREATE TABLE CUSTOMER_PRAG AS
SELECT CUST_NO,
       CUST_NM,
       CUST_STS_CD,
       BIRTH_DT,
       REG_CUST_AGE,
       POSTAL_CD,
       BSIC_ADDR,
       DTL_ADDR,
       MBLE_PHNE_NO,
       REG_PATH_CD,
       REG_DT,
       CUST_GRD_CD,
       TOTL_MLGE_SCRE,
       CUST_AUTH_YN,
       CUST_AUTH_TIME,
       INF_PRNT_AGR_YN,
       INF_PRNT_AGR_TIME,
       INF_PRNT_THR_AGR_YN,
       INF_PRNT_THJR_AGR_TIME,
       DOUT_DT,
       REG_TIME,
       MOD_TIME,
       RGTR_ID,
       MDFR_ID
FROM CUSTOMER2;


Taking advantage of the pg_table_size function, we can determine the size of the table:

SELECT pg_table_size('customer_prag');

pg_table_size|
-------------+
     35659776|

Note that the table's size decreased from 36 MB to 35 MB, which is around a 2.2% improvement in size. This time, we did not sacrifice the readability and maintainability of the data model because most columns are in the same order as the originally designed CUSTOMER table.
Reordering attributes is a no-brainer. Here is my go-to query when I need to rearrange the attributes to minimize the table size:

SELECT a.attname, t.typname, t.typalign, t.typlen
  FROM pg_class c
  JOIN pg_attribute a ON (a.attrelid = c.oid)
  JOIN pg_type t ON (t.oid = a.atttypid)
WHERE c.relname = 'customer'   -- we put the table name here we are trying to optimize for size.
  AND a.attnum >= 0
ORDER BY t.typlen DESC;


The above query implements a simple algorithm: "Sort the columns by their type length as defined in pg_type."
All things being equal, we can mix and match columns with matching type lengths, and if we are brave or desire a prettier column ordering, we can combine types of shorter lengths where necessary.
If we go with this table design, we can save some of the total space.

Conclusion
In PostgreSQL, the order in which you put columns actually makes a difference in terms of performance. By minimizing data size, we can ensure that smaller data fits into memory, resulting in faster applications. We should place fixed-length columns before variable-length columns. Eight-byte columns first, four-byte columns next, two-byte or one-byte columns last, and variable-length data types like NUMERIC and TEXT tacked to the end. If we have to maintain the readability and maintainability of the data model, we can simply reorder some audit attributes at the tail, placing variable-length columns at the bottom.


   

postgresdba.com