설문조사
PostgreSQL/PPAS 관련 듣고 싶은 교육은


총 게시물 169건, 최근 0 건
   

Metadata-driven approach to storing historical data 2

글쓴이 : 모델광 날짜 : 2022-04-02 (토) 10:04 조회 : 2173
--This article is the 2nd part of the previous article titled "Metadata-driven approach to storing historical data 1". You are kindly advised to read the previous one first.
In this note I will illustrate "the problems with the TO-BE data model".

* AS-IS model features

- Easy to understand

- There are many historical entity types in the database. We have to have one historical entity per every base table.

- When an attribute is added to the base table, we have to add the attribute to the historical entity type as well which means the total cost of managing the database increases.

- Ad hoc reporting against historical data is easy.

* TO-BE model features

- needs more storage because a row is inserted in the "attribute_name : value" structure.

- needs only one historical entity type for many base tables.

- This model leans on database's JSONB data type to store each historical row in a single column.

- It is incredibly flexible. When an attribue is added to the base table, we do nothing to the historical table.

- very advanced technique that can be difficult to implement at first.

- Ad hoc reporting against historical data is very difficult.

- Filtering based on a column's value becomes very inefficient.

One of the data architecture principles of this project is that you must not create a UDF, a procedure, a trigger, and a package in the database. So application developers will have to code everything for historical data. The data modeler and I are pessimistic about the metadata-driven data model. We think that the system will not be able to handle its throughput. But the data architect from the customer is very stubborn, so we are just going.

Today I decided to check the feasibility of the metadata-driven data model. Below are some test scripts to demonstrate that the TO-BE model is not acceptable.

To do the experiment efficiently I have created a trigger to store historical data.

The following is the table to store historical data of many base tables.

--I have moved the script to the comment section.


I have set up a trigger to fire when the data changes. We will fire the trigger once for every inserted/updated/deleted row.

​--I have moved the script to the comment section.


The following is a function to register the trigger against the requested table.

​--I have moved the script to the comment section.

The following example shows how the trigger works.

I have created a new table.


CREATE TABLE T1(

ID          INT             PRIMARY KEY,

NAME  VARCHAR(100)  NOT NULL);

I have enabled storing historical data on the table T1.


SELECT ENABLE_STORING('t1');

I have created a new record.

INSERT INTO T1(id, name) values (10,'TOM');


I have edited the row.

UPDATE T1 SET NAME = 'JAMES' WHERE ID = 10;


I have deleted the row.

DELETE FROM T1;

I have reviewed the table TOTAL_HISTORY;


SELECT * FROM TOTAL_HISTORY;

| ts | table_oid | table_schema | table_name | table_pk | pk_val | op | record | old_record |

| 2022-04-01 14:58:29.081756+09 | 25720 | public | t1 | id | 10 | INSERT | {"id": 10, "name": "TOM"} | |

| 2022-04-01 14:58:32.944023+09 | 25720 | public | t1 | id | 10 | UPDATE | {"id": 10, "name": "JAMES"} | {"id": 10, "name": "TOM"} |

| 2022-04-01 14:58:36.070793+09 | 25720 | public | t1 | id | 10 | DELETE | | {"id": 10, "name": "JAMES"} |

Notice that the table TOTAL_HISTORY stored data successfully as we do inserts/updates/deletes.

Performance

The table TOTAL_HISTORY always reduces throughput of inserts, updates, and deletes. In cases where throughput is less than 1000 writes per second the overhead is negligible. For tables with a higer write frequency, we are supposed to have nasty performance.

Let us check how much performance will be degraded.

I have created a table which does not have the trigger to store historical data.


create table online_order_no_trigger as select * from online_order where 1=2;

ALTER TABLE online_order_no_trigger

ADD CONSTRAINT online_order_no_trigger_PK PRIMARY KEY(ord_no);

insert into online_order_no_trigger select * from online_order;

INSERT 0 1000000

Time: 2596.037 ms (00:02.596)

I have created a table which has the trigger.


create table online_order_trigger

as select * from online_order where 1=2;

ALTER TABLE online_order_trigger

ADD CONSTRAINT online_order_trigger_PK PRIMARY KEY(ord_no);

select enable_storing('public.online_order_trigger');

insert into online_order_trigger select * from online_order;

INSERT 0 1000000

Time: 52934.343 ms (00:52.934)

It should be noted that the INSERT operation elapsed time has skyrocketd from 2.5 sec to 52.9 sec. Let's check the size of the table.

select pg_relation_size('online_order')

, pg_relation_size('online_order_no_trigger')

, pg_relation_size('online_order_trigger')

, pg_relation_size('total_history') --this is the size of the table for historical data

pg_relation_size | pg_relation_size | pg_relation_size | pg_relation_size

------------------+------------------+------------------+------------------

167190528   | 167190528  | 167190528  | 372285440

We can see that the table TOTAL_HISTORY is much bigger because it stores data with (column_name : column_value) structure.

Now let us examine how we can retrieve data from the table TOTAL_HISTORY and how it affects performance.

I have created a metadata-driven table to store a column name and its position in the table.

​--I have moved the script to the comment section.

INSERT INTO TA_META

    SELECT PA.ATTRELID, PA.ATTNUM, PA.ATTNAME, PA.ATTNUM, 'XZ'

      FROM PG_ATTRIBUTE PA, PG_CLASS PC

    WHERE PC.OID = PA.ATTRELID

         AND PC.RELNAME = 'online_order_trigger'

         AND PA.attnum > 0;

I assumed that there are 800 base tables with 100 columns for each table.

INSERT INTO TA_META

SELECT i::oid as table_oid, k::text as attr_id, k::text||'colval', k::int, 'ac'

   FROM generate_series(1,800) a(i), generate_series(1,100) b(k);

Let's suppose that we have to fetch a row with columns of ord_no, ord_date, ord_status_cd on 04/11/2022 from historical data of the table ONLINE_ORDER_TRIGGER. The value of the column ORD_NO is 555.

Then we can write an SQL statement like this:


SELECT A.RECORD ->> B.cols[1] AS ORD_NO   --this extracts the 'ord_no' value from JSONB data

          ,A.RECORD ->> B.COLS[2] AS ORD_DATE

          ,A.RECORD ->> B.COLS[3] AS ORD_STATUS_CD

   FROM TOTAL_HISTORY A

          , (SELECT TABLE_OID, array_agg(ATTR_NAME) as cols

                FROM TA_META

              WHERE TABLE_OID IN (25705) --25705 is the oid of the table ONLINE_ORDER_TRIGGER

                   AND ATTR_NAME IN ('ord_no', 'ord_date','ord_status_cd')

              GROUP BY TABLE_OID

            ) B

 WHERE

     AND A.TABLE_OID = B.TABLE_OID

     AND A.pk_val = 555::text

     AND A.TS < '2022-04-11'::TIMESTAMP + INTERVAL '1 DAY'

ORDER BY A.TS DESC

FETCH NEXT 1 ROWS ONLY;

I am not so familiar with extracting data from a JSONB column. So the above query might not be an optimal query.

Here is the execution plan produced by the optimizer.


Limit (actual time=0.066..0.067 rows=1 loops=1)
  Buffers: shared hit=5 read=3
  ->  Sort (actual time=0.065..0.065 rows=1 loops=1)
        Sort Key: a.ts DESC
        Sort Method: quicksort  Memory: 25kB
        Buffers: shared hit=5 read=3
        ->  Nested Loop (actual time=0.057..0.060 rows=2 loops=1)
              Buffers: shared hit=5 read=3
              ->  GroupAggregate (actual time=0.016..0.016 rows=1 loops=1)
                    Group Key: ta_meta.table_oid
                    Buffers: shared hit=3
                    ->  Bitmap Heap Scan on ta_meta (actual time=0.010..0.012 rows=3 loops=1)
                          Recheck Cond: (table_oid = '25705'::oid)
                          Filter: ((attr_name)::text = ANY ('{ord_no,ord_date,ord_status_cd}'::text[]))
                          Rows Removed by Filter: 3
                          Heap Blocks: exact=1
                          Buffers: shared hit=3
                          ->  Bitmap Index Scan on ta_meta_pk (actual time=0.006..0.006 rows=6 loops=1)
                                Index Cond: (table_oid = '25705'::oid)
                                Buffers: shared hit=2
              ->  Index Scan using total_history_pk on total_history a (actual time=0.019..0.021 rows=2 loops=1)
                    Index Cond: ((table_oid = ta_meta.table_oid) AND ((pk_val)::text = '555'::text) AND (ts < '2022-04-12 00:00:00'::timestamp without time zone))
                    Buffers: shared hit=2 read=3
Planning:
  Buffers: shared hit=21 read=1
Planning Time: 0.455 ms
Execution Time: 0.131 ms


While accessing the total_history_pk index it visited only 5 blocks (2 blocks in memory and 3 blocks in the DISK), which is very efficient. I was astonished by this experiment. As long as you include a primary key value in the predicates of your query, getting historical data set is very fast. It can meet the UX-defined time limits. I incurred a lot of transactions against the table ONLINE_ORDER_TRIGGER, which caused many rows to be loaded into the table TOTAL_HISTORY, but when I issued the same query, the number of Index Scan block I/O didn't rise. Keep in mind that the column order of the index TOTAL_HISTORY_PK is of critical importance. If you put the column TS in the first, the number of block I/Os will skyrocket.

From this experiment I did, I came to a conclusion that the metadata driven approach to handling historical data can meet the UX-defined time limits as long as we use the primary key value in the where clause.


Conclusion

If you always query againt a table using primary key values in the where clause, the metadata-driven approach to storing historical data is acceptable in terms of performance.


The downside of the approach is as follows:

1. Even though we can reduce the number of tables for historical data dramatically, we need more disk space in the long run.

2. It requires sophisticated index strategy and advanced-level query writers.

3. If you do not use a primary key value in the where clause, you will see an hourglass.


The advantage of the approach is as follows:

1. It is very flexible so that you don't have to do anything against the table for historical data when a column is added or dropped in the base table.

​Footnote

About 10 years ago I had taken a data modeling course offered by ENCORE. The data model features presented at the start of this note were provided at the course. The instructor's name was Jang HeeSik. I rarely used the metadata-driven model in my career.

Addendum

The TO-BE model is based on the business requirement that the primary key value is never modified.

Added on 9, APRIL

I am afraid the conclusion I made was not reasonable. It was very hasty. I did not take the insert operation into account. If there are many inserts in base tables, the database system will not be able to handle contention in the table TOTAL_HISTORY. Surely, tests with real application transactions will be able to tell whether the TO-BE model is feasable. I have a hunch that the meta-data driven model can not take care of contention incurred by insert operations.



모델광 2022-04-02 (토) 10:13
The following is the table to store historical data of many base tables.

CREATE TABLE TOTAL_HISTORY(
TS                  TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
TABLE_OID        OID NOT        NULL,        --pg internal id for a table
PK_VAL            VARCHAR(2000) NOT NULL,  --key value
TABLE_PK        VARCHAR(200)  NOT NULL,  --tablee's primary key columns
TABLE_SCHEMA VARCHAR(20)  NOT NULL,  --table's schama name e.g. 'scott'
TABLE_NAME    VARCHAR(100)  NOT NULL,  --tables's table name e.g. 'emp'
OP                  VARCHAR(6)      NOT NULL,  --INSERT/UPDATE/DELETE
RECORD            JSONB,                              --contents of the new row
OLD_RECORD    JSONB                                --previous row contents (for UPDATE/DELETE)
);
CREATE INDEX TOTAL_HISTORY_TS_X01      ON TOTAL_HISTORY USING BRIN(TS);
CREATE INDEX TOTAL_HISTORY_TABLE_OID ON TOTAL_HISTORY (TABLE_OID);
ALTER TABLE TOTAL_HISTORY DROP CONSTRAINT TOTAL_HISTORY_PK
ALTER TABLE TOTAL_HISTORY ADD CONSTRAINT TOTAL_HISTORY_PK
                                            PRIMARY KEY(TABLE_OID, PK_VAL, TS);

I have set up a trigger to fire when the data changes. We will fire the trigger once for every inserted/updated/deleted row.

CREATE OR REPLACE FUNCTION INSERT_UPDATE_DELETE_TRIGGER()
RETURNS TRIGGER
LANGUAGE PLPGSQL
AS $$
DECLARE
  PKEY_COLS              TEXT;
  RECORD_JSONB        JSONB = TO_JSONB(NEW);
  PK_VALUE                TEXT;
  OLD_RECORD_JSONB JSONB = TO_JSONB(OLD);
BEGIN
  SELECT PA.ATTNAME::TEXT
    INTO PKEY_COLS
    FROM PG_INDEX PI, PG_ATTRIBUTE PA
  WHERE PI.INDRELID = PA.ATTRELID
      AND PA.ATTNUM = ANY(PI.INDKEY)
      AND PI.INDRELID = TG_RELID
      AND INDISPRIMARY;
  PK_VALUE = COALESCE(TO_JSONB(NEW),TO_JSONB(OLD)) ->> PKEY_COLS;
  insert into TOTAL_HISTORY (
      table_oid, table_schema, table_name, table_pk, pk_val, op,record, old_record )
  select TG_RELID,
          TG_TABLE_SCHEMA,
          TG_TABLE_NAME,
          PKEY_COLS,
          PK_VALUE,
          TG_OP,
          record_jsonb,
          old_record_jsonb;
  RETRUN coalesce(new, old);
END; $$;

The following is a function to register the trigger against the requested table.

CREATE OR REPLACE FUNCTION ENABLE_STORING(REGCLASS)
RETURNS VOID
VOLATILE LANGUAGE PLPGSQL
AS $$
DECLARE
  V_SQL TEXT = FORMAT('CREATE TRIGGER AUDIT_I_U_D
                                        BEFORE INSERT OR UPDATE OR DELETE
                                        ON %I
                                      FOR EACH ROW
                              EXECUTE PROCEDURE INSERT_UPDATE_DELETE_TRIGGER();',$1);
BEGIN
  IF PKEY_COLS IS NULL THEN
      RAISE EXCEPTION 'Table % can not be stored because it has no primary key', $1;
  END IF;
  IF NOT EXISTS (SELECT 1 FROM PG_TRIGGER
                          WHERE TGRELID = $1 AND TGNAME = 'audit_i_u_d')
                THEN EXECUTE V_SQL;
  END IF;
END;
$$;

​I have created a metadata-driven table to store a column name and its position in the table.

CREATE TABLE TA_META(
  TABLE_OID          OID              NOT NULL, --pg internal id for a table
  ATTR_ID              VARCHAR(10) NOT NULL,
  ATTR_NAME        VARCHAR(50) NOT NULL, --column name
  ATTR_POSITION  INT              NOT NULL, --the position of the column in the table
  DUMMY CHAR(50)
);

ALTER TABLE TA_META ADD CONSTRAINT TA_META_PK PRIMARY KEY (TABLE_OID, ATTR_ID);
댓글주소
   

postgresdba.com