--This article is the 2nd part of the previous article titled "Metadata-driven approach to storing historical data 1". You are kindly advised to read the previous one first.
In this note I will illustrate "the problems with the TO-BE data model".
* AS-IS model features
- Easy to understand
- There are many historical entity types in the database. We have to have one historical entity per every base table.
-
When an attribute is added to the base table, we have to add the
attribute to the historical entity type as well which means the total
cost of managing the database increases.
- Ad hoc reporting against historical data is easy.
* TO-BE model features
- needs more storage because a row is inserted in the "attribute_name : value" structure.
- needs only one historical entity type for many base tables.
- This model leans on database's JSONB data type to store each historical row in a single column.
- It is incredibly flexible. When an attribue is added to the base table, we do nothing to the historical table.
- very advanced technique that can be difficult to implement at first.
- Ad hoc reporting against historical data is very difficult.
- Filtering based on a column's value becomes very inefficient.
One of the data architecture principles of this project is that you must not create a UDF, a procedure, a trigger, and a package in the database.
So application developers will have to code everything for historical
data. The data modeler and I are pessimistic about the metadata-driven
data model. We think that the system will not be able to handle its
throughput. But the data architect from the customer is very stubborn,
so we are just going.
Today I decided to check the feasibility of the metadata-driven data model. Below are some test scripts to demonstrate that the TO-BE model is not acceptable.
To do the experiment efficiently I have created a trigger to store historical data.
The following is the table to store historical data of many base tables.
--I have moved the script to the comment section.
I have set up a trigger to fire when the data changes. We will fire the trigger once for every inserted/updated/deleted row.
--I have moved the script to the comment section.
The following is a function to register the trigger against the requested table.
--I have moved the script to the comment section.
The following example shows how the trigger works.
I have created a new table.
CREATE TABLE T1(
ID INT PRIMARY KEY,
NAME VARCHAR(100) NOT NULL);
I have enabled storing historical data on the table T1.
SELECT ENABLE_STORING('t1');
I have created a new record.
INSERT INTO T1(id, name) values (10,'TOM');
I have edited the row.
UPDATE T1 SET NAME = 'JAMES' WHERE ID = 10;
I have deleted the row.
DELETE FROM T1;
I have reviewed the table TOTAL_HISTORY;
SELECT * FROM TOTAL_HISTORY;
| ts | table_oid | table_schema | table_name | table_pk | pk_val | op | record | old_record |
| 2022-04-01 14:58:29.081756+09 | 25720 | public | t1 | id | 10 | INSERT | {"id": 10, "name": "TOM"} | |
|
2022-04-01 14:58:32.944023+09 | 25720 | public | t1 | id | 10 | UPDATE |
{"id": 10, "name": "JAMES"} | {"id": 10, "name": "TOM"} |
|
2022-04-01 14:58:36.070793+09 | 25720 | public | t1 | id | 10 | DELETE |
| {"id": 10,
"name": "JAMES"} |
Notice that the table TOTAL_HISTORY stored data successfully as we do inserts/updates/deletes.
Performance
The
table TOTAL_HISTORY always reduces throughput of inserts, updates, and
deletes. In cases where throughput is less than 1000 writes per second
the overhead is negligible. For tables with a higer write frequency, we
are supposed to have nasty performance.
Let us check how much performance will be degraded.
I have created a table which does not have the trigger to store historical data.
create table online_order_no_trigger as select * from online_order where 1=2;
ALTER TABLE online_order_no_trigger
ADD CONSTRAINT online_order_no_trigger_PK PRIMARY KEY(ord_no);
insert into online_order_no_trigger select * from online_order;
INSERT 0 1000000
Time: 2596.037 ms (00:02.596)
I have created a table which has the trigger.
create table online_order_trigger
as select * from online_order where 1=2;
ALTER TABLE online_order_trigger
ADD CONSTRAINT online_order_trigger_PK PRIMARY KEY(ord_no);
select enable_storing('public.online_order_trigger');
insert into online_order_trigger select * from online_order;
INSERT 0 1000000
Time: 52934.343 ms (00:52.934)
It should be noted that the INSERT operation elapsed time has skyrocketd from 2.5 sec to 52.9 sec. Let's check the size of the table.
select pg_relation_size('online_order')
, pg_relation_size('online_order_no_trigger')
, pg_relation_size('online_order_trigger')
, pg_relation_size('total_history') --this is the size of the table for historical data
pg_relation_size | pg_relation_size | pg_relation_size | pg_relation_size
------------------+------------------+------------------+------------------
167190528 | 167190528 | 167190528 | 372285440
We can see that the table TOTAL_HISTORY is much bigger because it stores data with (column_name : column_value) structure.
Now let us examine how we can retrieve data from the table TOTAL_HISTORY and how it affects performance.
I have created a metadata-driven table to store a column name and its position in the table.
--I have moved the script to the comment section.
INSERT INTO TA_META
SELECT PA.ATTRELID, PA.ATTNUM, PA.ATTNAME, PA.ATTNUM, 'XZ'
FROM PG_ATTRIBUTE PA, PG_CLASS PC
WHERE PC.OID = PA.ATTRELID
AND PC.RELNAME = 'online_order_trigger'
AND PA.attnum > 0;
I assumed that there are 800 base tables with 100 columns for each table.
INSERT INTO TA_META
SELECT i::oid as table_oid, k::text as attr_id, k::text||'colval', k::int, 'ac'
FROM generate_series(1,800) a(i), generate_series(1,100) b(k);
Let's
suppose that we have to fetch a row with columns of ord_no, ord_date,
ord_status_cd on 04/11/2022 from historical data of the table
ONLINE_ORDER_TRIGGER. The value of the column ORD_NO is 555.
Then we can write an SQL statement like this:
SELECT A.RECORD ->> B.cols[1] AS ORD_NO --this extracts the 'ord_no' value from JSONB data
,A.RECORD ->> B.COLS[2] AS ORD_DATE
,A.RECORD ->> B.COLS[3] AS ORD_STATUS_CD
FROM TOTAL_HISTORY A
, (SELECT TABLE_OID, array_agg(ATTR_NAME) as cols
FROM TA_META
WHERE TABLE_OID IN (25705) --25705 is the oid of the table ONLINE_ORDER_TRIGGER
AND ATTR_NAME IN ('ord_no', 'ord_date','ord_status_cd')
GROUP BY TABLE_OID
) B
WHERE
AND A.TABLE_OID = B.TABLE_OID
AND A.pk_val = 555::text
AND A.TS < '2022-04-11'::TIMESTAMP + INTERVAL '1 DAY'
ORDER BY A.TS DESC
FETCH NEXT 1 ROWS ONLY;
I am not so familiar with extracting data from a JSONB column. So the above query might not be an optimal query.
Here is the execution plan produced by the optimizer.
Limit (actual time=0.066..0.067 rows=1 loops=1)
Buffers: shared hit=5 read=3
-> Sort (actual time=0.065..0.065 rows=1 loops=1)
Sort Key: a.ts DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=5 read=3
-> Nested Loop (actual time=0.057..0.060 rows=2 loops=1)
Buffers: shared hit=5 read=3
-> GroupAggregate (actual time=0.016..0.016 rows=1 loops=1)
Group Key: ta_meta.table_oid
Buffers: shared hit=3
-> Bitmap Heap Scan on ta_meta (actual time=0.010..0.012 rows=3 loops=1)
Recheck Cond: (table_oid = '25705'::oid)
Filter: ((attr_name)::text = ANY ('{ord_no,ord_date,ord_status_cd}'::text[]))
Rows Removed by Filter: 3
Heap Blocks: exact=1
Buffers: shared hit=3
-> Bitmap Index Scan on ta_meta_pk (actual time=0.006..0.006 rows=6 loops=1)
Index Cond: (table_oid = '25705'::oid)
Buffers: shared hit=2
-> Index Scan using total_history_pk on total_history a (actual time=0.019..0.021 rows=2 loops=1)
Index Cond: ((table_oid = ta_meta.table_oid) AND ((pk_val)::text =
'555'::text) AND (ts < '2022-04-12 00:00:00'::timestamp without time
zone))
Buffers: shared hit=2 read=3
Planning:
Buffers: shared hit=21 read=1
Planning Time: 0.455 ms
Execution Time: 0.131 ms
While
accessing the total_history_pk index it visited only 5 blocks (2 blocks
in memory and 3 blocks in the DISK), which is very efficient. I was
astonished by this experiment. As long as you include a primary key value in the predicates of your query, getting historical data set is very fast. It can meet the UX-defined time limits. I incurred a lot of transactions against the table ONLINE_ORDER_TRIGGER,
which caused many rows to be loaded into the table TOTAL_HISTORY, but
when I issued the same query, the number of Index Scan block I/O didn't
rise. Keep in mind that the column order of the index TOTAL_HISTORY_PK is of critical importance. If you put the column TS in the first, the number of block I/Os will skyrocket.
From this experiment I did, I came to a conclusion that the metadata driven approach to handling historical data can meet the UX-defined time limits as long as we use the primary key value in the where clause.
Conclusion
If you always query againt a table using primary key values in the where clause, the metadata-driven approach to storing historical data is acceptable in terms of performance.
The downside of the approach is as follows:
1. Even though we can reduce the number of tables for historical data dramatically, we need more disk space in the long run.
2. It requires sophisticated index strategy and advanced-level query writers.
3. If you do not use a primary key value in the where clause, you will see an hourglass.
The advantage of the approach is as follows:
1.
It is very flexible so that you don't have to do anything against the
table for historical data when a column is added or dropped in the base
table.
Footnote
About
10 years ago I had taken a data modeling course offered by ENCORE. The
data model features presented at the start of this note were provided at
the course. The instructor's name was Jang HeeSik. I rarely used the
metadata-driven model in my career.
Addendum
The TO-BE model is based on the business requirement that the primary key value is never modified.
Added on 9, APRIL
I am afraid the conclusion I made was not reasonable. It was very hasty. I did not take the insert operation into account. If there are many inserts in base tables, the database system will not be able to handle contention in the table TOTAL_HISTORY. Surely, tests with real application transactions will be able to tell whether the TO-BE model is feasable. I have a hunch that the meta-data driven model can not take care of contention incurred by insert operations.