A good, future-proof data model is one of the most challenging problems when building applications. That is especially true when working on applications to store historical data. There are two approaches for storing historical data. One is "time period modeling" and the other is "audit modeling". Choosing the most appropriate approach necessitates careful analysis.
Recently, I was tasked with designing a table to store historical data regarding the assignment of stores to specific operational organizations.
The business requirements are as follows:
- Stores can have their operational organizations changed only once a day.
- The effective date of the operational organization can vary for each store.
- Each store must belong to at least one operational organization.
- An operational organization may not have any assigned stores.
The following ERD was reviewed to determine which approach, "time period modelling" or "audit modelling", is more suitable.
Within the above ERD, I deliberately designed two association entities. One was tailored for audit modeling and the other for time period modeling.
In the remainder of this note, I am going to assess which model is best suited for the business requirements outlined above.
At first sight, due to the fact that PostgreSQL triggers an insert operation during updates and there is a burden of frequent vacuuming, time period modeling appears less viable. So let's proceed to generate some test data and conduct the experiment to verify the deduction made above.
drop table OPER_ORG;
create table OPER_ORG(
OPER_ID VARCHAR(10) not NULL,
OPER_NAME VARCHAR(100) not NULL,
OPER_SP_CD VARCHAR(1) not NULL,
INSERT_ID VARCHAR(100) not NULL
);
insert into oper_org
select i::text
, i::text||'_OPER_ORG_NM'
, case when mod(i,5) = 0 then 'T' else 'P' end
, REPEAT('O',32)
from generate_series(1,100) a(i);
create unique index oper_org_uk on oper_org(oper_id);
Here is a table designed using an audit modeling apprach.drop table STR_OPER_ORG;
create table STR_OPER_ORG(
STORE_ID VARCHAR(10) not NULL,
OPER_ID VARCHAR(10) not NULL,
APP_DT DATE not NULL,
INSERT_ID VARCHAR(100) not NULL
);
insert into str_oper_org
select i as store_id
, mod(i,100)+1 as oper_id
, '20230810'::DATE
, REPEAT('K',32)
from generate_series(1, 90000) a(i);
insert into str_oper_org
select i as store_id
, mod(i,100)+1 as oper_id
, '20230810'::DATE - mod(i,10)-2
, REPEAT('K',32)
from generate_series(1, 90000) a(i);
insert into str_oper_org
select i as store_id
, mod(i,100)+1 as oper_id
, '20230810'::DATE - 20
, REPEAT('S',32)
from generate_series(1, 90000) a(i);
create unique index str_oper_org_uk on str_oper_org(oper_id, app_dt, store_id);
create unique index STR_OPER_OG_UK2 on STR_OPER_ORG(OPER_ID, STORE_ID, APP_DT);
Here is a table designed using a time period modeling approach.create table STR_OPER_ORG_TP(
STORE_ID VARCHAR(10) not NULL,
OPER_ID VARCHAR(10) not NULL,
APP_BEGIN_DT DATE not NULL,
APP_END_DT DATE not null default '9999-12-31'::DATE,
INSERT_ID VARCHAR(100) not NULL
);
insert into STR_OPER_ORG_TP
select STORE_ID, OPER_ID, APP_DT
, coalesce(LAG(APP_DT-1) over (partition by STORE_ID, OPER_ID order by APP_DT DESC),'99991231'::DATE)
, INSERT_ID
from STR_OPER_ORG;
create index STR_OPER_ORG_TP_UK on STR_OPER_ORG_TP(OPER_ID, APP_END_DT, APP_BEGIN_DT, STORE_ID);
drop index STR_OPER_ORG_TP_UK2;
create index STR_OPER_ORG_TP_UK2 on STR_OPER_ORG_TP(APP_END_DT, APP_BEGIN_DT, OPER_ID, STORE_ID);
drop table STORE;
create table STORE(
STORE_ID VARCHAR(10) not NULL,
STORE_NM VARCHAR(100) not null,
INSERT_ID VARCHAR(200) not NULL
);
insert into store
select i::text
, i::text||'_store_name'
, REPEAT('S',200)
from generate_series(1, 90000) a(i);
select * from store limit 100;
create unique index store_uk on store(store_id);
analyze STORE, STR_OPER_ORG, OPER_ORG;
select PG_RELATION_SIZE('oper_org') as "영업조직"
, pg_relation_size('str_oper_org') as "매장영업조직_점이력"
, pg_relation_size('str_oper_org_tp') as "매장영업조직_선분이력"
, pg_relation_size('store') as "매장";
영업조직 |매장영업조직_점이력|매장영업조직_선분이력|매장
-----+----------+-----------+--------+
16384| 22372352| 22806528|23781376|
Let's assume that we need to extract a store name on August 1, 2023, with the organization ID of 10.
If we have an association table following the "audit modeling" technique, we can construct a query as follows:select C.OPER_NAME, A.STORE_NM
from OPER_ORG C
, (select
distinct on (STORE_ID) OPER_ID, STORE_ID, APP_DT
from STR_OPER_ORG
where OPER_ID = '10'
and APP_DT <= '20230801'::DATE
order by STORE_ID desc, APP_DT desc
) b
, STORE A
where C.OPER_ID = B.OPER_iD
and B.STORE_ID = A.STORE_ID
and C.OPER_ID = '10'
;
Here is the execution plan:
Nested Loop (actual time=0.037..2.333 rows=900 loops=1)
Buffers: shared hit=2716
-> Nested Loop (actual time=0.031..0.627 rows=900 loops=1)
Buffers: shared hit=16
-> Index Scan using oper_org_uk on oper_org c (actual time=0.012..0.013 rows=1 loops=1)
Index Cond: ((oper_id)::text = '10'::text)
Buffers: shared hit=2
-> Subquery Scan on b (actual time=0.016..0.540 rows=900 loops=1)
Filter: ((b.oper_id)::text = '10'::text)
Buffers: shared hit=14
-> Unique (actual time=0.015..0.454 rows=900 loops=1)
Buffers: shared hit=14
-> Index Only Scan Backward using str_oper_og_uk2 on str_oper_org (actual time=0.014..0.277 rows=1800 loops=1)
Index Cond: ((oper_id = '10'::text) AND (app_dt <= '2023-08-01'::date))
Heap Fetches: 0
Buffers: shared hit=14
-> Index Scan using store_uk on store a (actual time=0.002..0.002 rows=1 loops=900)
Index Cond: ((store_id)::text = (b.store_id)::text)
Buffers: shared hit=2700
Planning Time: 0.237 ms
Execution Time: 2.400 ms
If you are not familiar with the distinct on clause, you can write the query as follows:
select C.OPER_NAME, A.STORE_NM
from OPER_ORG C
, (Select MAX(OPER_ID) as oper_id, STORE_ID, MAX(APP_DT) as APP_DT
from STR_OPER_ORG
where OPER_ID = '10'
and APP_DT <= '20230801'::DATE
group by STORE_ID
) b
, STORE A
where C.OPER_ID = B.OPER_iD
and B.STORE_ID = A.STORE_ID
and C.OPER_ID = '10'
;
Nested Loop (actual time=0.427..2.617 rows=900 loops=1)
Buffers: shared hit=2713
-> Nested Loop (actual time=0.420..0.619 rows=900 loops=1)
Buffers: shared hit=13
-> Index Scan using oper_org_uk on oper_org c (actual time=0.013..0.014 rows=1 loops=1)
Index Cond: ((oper_id)::text = '10'::text)
Buffers: shared hit=2
-> HashAggregate (actual time=0.406..0.540 rows=900 loops=1)
Group Key: str_oper_org.store_id
Filter: (max((str_oper_org.oper_id)::text) = '10'::text)
Batches: 1 Memory Usage: 193kB
Buffers: shared hit=11
-> Index Only Scan using str_oper_org_uk on str_oper_org (actual time=0.014..0.183 rows=1800 loops=1)
Index Cond: ((oper_id = '10'::text) AND (app_dt <= '2023-08-01'::date))
Heap Fetches: 0
Buffers: shared hit=11
-> Index Scan using store_uk on store a (actual time=0.002..0.002 rows=1 loops=900)
Index Cond: ((store_id)::text = (str_oper_org.store_id)::text)
Buffers: shared hit=2700
Planning Time: 0.197 ms
Execution Time: 2.684 ms
If we have an association table following the "time period modeling" technique, we can write a query like this:
select C.OPER_NAME, A.STORE_NM
from OPER_ORG C
, STR_OPER_ORG_TP B
, STORE A
where C.OPER_ID = '10'
and C.OPER_ID = B.OPER_ID
and '2023-08-01'::DATE between B.APP_BEGIN_DT and B.APP_END_DT
and B.STORE_ID = A.STORE_ID;
And here is the execution plan:
Nested Loop (actual time=0.022..1.976 rows=900 loops=1)
Buffers: shared hit=2715
-> Nested Loop (actual time=0.019..0.224 rows=900 loops=1)
Buffers: shared hit=15
-> Index Scan using oper_org_uk on oper_org c (actual time=0.006..0.007 rows=1 loops=1)
Index Cond: ((oper_id)::text = '10'::text)
Buffers: shared hit=2
-> Index Only Scan using str_oper_org_tp_uk on str_oper_org_tp b (actual time=0.011..0.148 rows=900 loops=1)
Index Cond: ((oper_id = '10'::text) AND (app_end_dt >= '2023-08-01'::date) AND (app_begin_dt <= '2023-08-01'::date))
Heap Fetches: 0
Buffers: shared hit=13
-> Index Scan using store_uk on store a (actual time=0.002..0.002 rows=1 loops=900)
Index Cond: ((store_id)::text = (b.store_id)::text)
Buffers: shared hit=2700
Planning:
Buffers: shared hit=6
Planning Time: 0.185 ms
Execution Time: 2.039 ms
Note that the number of block I/Os decreased slightly and the elapsed time also decreased. When we examine the index accesses of the table STR_OPER_ORG and STO_OPER_ORG_TP respectively, we can observe that the number of accessed rows descreased from 1800 to 900. As additional historical data accumulates in the associative table, the performance of the audit modeling table will degrade.
Conversely, we will maintain consistent performance with the time period modeling table.
The following table summarizes the strengths and weaknesses of both audit modeling and time period modeling.
|
ADVANTAGES |
DISADVANTAGES |
audit modeling |
- can prevent data duplication - takes up less storage
|
- requeres advanced SQL skill in trieving data - poor retrieval performace
|
time period modeling
|
- constructing a query is easy - good retrieval performance
|
- the APP_END_DT column is a derived attribute - the time period can be overlapped - high chances of a table bloat - bigger index/table size
|
Conclusion
Given this particluar business requirement described at the biginning of this note, time period modeling is more favorable than audit modeling.
Footnote
To refine the conclusion I made here, I conducted an additional experiment. Let's suppose that we need to retrieve the organization names of all the stores.
If we possess an association table following the "audit modeling" technique, we can formulate a query as follows:select A.STORE_ID, A.STORE_NM, C.OPER_NAME
from OPER_ORG C
, (select
distinct on (OPER_ID, STORE_ID) OPER_ID, STORE_ID, APP_DT
from str_oper_org
where APP_DT <= '20230801'::DATE
order by OPER_ID desc, STORE_ID desc, APP_DT DESC
) b
, STORE A
where C.OPER_ID = B.OPER_iD
and B.STORE_ID = A.STORE_ID
;
Hash Join (actual time=18.826..80.352 rows=90000 loops=1)
Hash Cond: ((str_oper_org.store_id)::text = (a.store_id)::text)
Buffers: shared hit=3943
-> Hash Join (actual time=0.061..38.456 rows=90000 loops=1)
Hash Cond: ((str_oper_org.oper_id)::text = (c.oper_id)::text)
Buffers: shared hit=1040
-> Unique (actual time=0.021..25.142 rows=90000 loops=1)
Buffers: shared hit=1038
-> Index Only Scan Backward using str_oper_og_uk2 on str_oper_org (actual time=0.020..12.649 rows=117000 loops=1)
Index Cond: (app_dt <= '2023-08-01'::date)
Heap Fetches: 0
Buffers: shared hit=1038
-> Hash (actual time=0.028..0.029 rows=100 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 13kB
Buffers: shared hit=2
-> Seq Scan on oper_org c (actual time=0.010..0.018 rows=100 loops=1)
Buffers: shared hit=2
-> Hash (actual time=18.690..18.690 rows=90000 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 5837kB
Buffers: shared hit=2903
-> Seq Scan on store a (actual time=0.005..8.637 rows=90000 loops=1)
Buffers: shared hit=2903
Planning Time: 0.320 ms
Execution Time: 82.876 ms
If you are distinct on shy, you can write the query as follows:
select A.STORE_ID, A.STORE_NM, C.OPER_NAME
from OPER_ORG C
, (select OPER_ID, STORE_ID, max(APP_DT)
from str_oper_org
where APP_DT <= '20230801'::DATE
group by oper_id, store_id
) b
, STORE A
where C.OPER_ID = B.OPER_iD
and B.STORE_ID = A.STORE_ID
;
Hash Join (actual time=17.629..89.303 rows=90000 loops=1)
Hash Cond: ((str_oper_org.store_id)::text = (a.store_id)::text)
Buffers: shared hit=3943
-> Hash Join (actual time=0.044..49.968 rows=90000 loops=1)
Hash Cond: ((str_oper_org.oper_id)::text = (c.oper_id)::text)
Buffers: shared hit=1040
-> GroupAggregate (actual time=0.018..36.549 rows=90000 loops=1)
Group Key: str_oper_org.oper_id, str_oper_org.store_id
Buffers: shared hit=1038
-> Index Only Scan using str_oper_og_uk2 on str_oper_org (actual time=0.013..13.874 rows=117000 loops=1)
Index Cond: (app_dt <= '2023-08-01'::date)
Heap Fetches: 0
Buffers: shared hit=1038
-> Hash (actual time=0.022..0.023 rows=100 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 13kB
Buffers: shared hit=2
-> Seq Scan on oper_org c (actual time=0.005..0.013 rows=100 loops=1)
Buffers: shared hit=2
-> Hash (actual time=17.493..17.494 rows=90000 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 5837kB
Buffers: shared hit=2903
-> Seq Scan on store a (actual time=0.005..8.183 rows=90000 loops=1)
Buffers: shared hit=2903
Planning:
Buffers: shared hit=1
Planning Time: 0.180 ms
Execution Time: 92.257 ms
If we have an association table following the "time period modeling" technique, we can construct a query as follows:select C.OPER_NAME, A.STORE_NM
from OPER_ORG C
, STR_OPER_ORG_TP B
, STORE A
where 1=1
and C.OPER_ID = B.OPER_ID
and '2023-08-01'::DATE between B.APP_BEGIN_DT and B.APP_END_DT
and B.STORE_ID = A.STORE_ID;
Hash Join (actual time=19.798..69.635 rows=90000 loops=1)
Hash Cond: ((b.store_id)::text = (a.store_id)::text)
Buffers: shared hit=4054
-> Hash Join (actual time=0.037..26.222 rows=90000 loops=1)
Hash Cond: ((b.oper_id)::text = (c.oper_id)::text)
Buffers: shared hit=1151
-> Index Only Scan using str_oper_org_tp_uk2 on str_oper_org_tp b (actual time=0.011..12.242 rows=90000 loops=1)
Index Cond: ((app_end_dt >= '2023-08-01'::date) AND (app_begin_dt <= '2023-08-01'::date))
Heap Fetches: 0
Buffers: shared hit=1149
-> Hash (actual time=0.021..0.022 rows=100 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 13kB
Buffers: shared hit=2
-> Seq Scan on oper_org c (actual time=0.004..0.011 rows=100 loops=1)
Buffers: shared hit=2
-> Hash (actual time=19.694..19.694 rows=90000 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 5837kB
Buffers: shared hit=2903
-> Seq Scan on store a (actual time=0.003..8.881 rows=90000 loops=1)
Buffers: shared hit=2903
Planning:
Buffers: shared hit=14
Planning Time: 0.273 ms
Execution Time: 72.798 ms
Note that while the number of block I/Os increased from 3943 to 4053, the elapsed time decreased from 82 ms to 72 ms. The rise in block I/O is a result of the larger index size in time period modeling.
Anyhow, this experiment verifies that time period modeling is superior to audit modeling for this speicfic business requirement.