PostgreSQL 은 특별합니다.
단순한 데이터베이스가 아니라, 하나의 어플리케이션 플랫폼이기 때문입니다.

공지사항

뉴스 및 기사

	총 게시물 183건, 최근 2 건

Shrinking the Join Set: Outsmarting AI with Business Knowledge

글쓴이 : 모델광 날짜 : 2026-06-18 (목) 18:11 조회 : 202

In the past SQL tuning was a task peformed by professionals. But with the rise of generative AI, SQL tuning has become a task performed by almost anyone. In most cases, an AI assistant outpeforms a professional sql tuner.

However, there is a class of database optimization where AI still falls short: tuning driven by deep business logic. When a developer truly understands the relationshop between data elements, they can leverage optimization strategies that LLMs simply can not see.

In this article, we will look at how to use transitive predicate generation to drastically reduce the join set and outperform generic AI optimization.

Let's start by creating a realistic test environment. We will simulate an operational schema with online_orders, offline_orders, and a shared ord_items table.

-- Create and populate Online Orders
DROP TABLE IF EXISTS online_order CASCADE;
CREATE TABLE online_order (
ord_no bigint NOT NULL,
cust_no int NOT NULL,
ord_date date NOT NULL,
ord_dt varchar(8) NOT NULL,
ord_status_cd varchar(1) NOT NULL,
comment varchar(100),
CONSTRAINT online_order_pk PRIMARY KEY (ord_no)
);

INSERT INTO online_order
SELECT
i AS ord_no,
MOD(i, 1000000) AS cust_no,
'2022-01-20'::date + ((i - 1) / 2000)::int AS ord_date,
TO_CHAR('2022-01-20'::date + ((i - 1) / 2000)::int, 'YYYYMMDD') AS ord_dt,
(MOD(i, 4) + 1)::varchar(1) AS ord_status_cd,
LPAD('x', 100, 'x') AS comment
FROM generate_series(1, 2000000, 2) AS a(i);

-- 2. Create and populate Offline Orders
DROP TABLE IF EXISTS offline_order CASCADE;
CREATE TABLE offline_order (
ord_no bigint NOT NULL,
cust_no int NOT NULL,
ord_date date NOT NULL,
ord_dt varchar(8) NOT NULL,
ord_status_cd varchar(1) NOT NULL,
empno int,
comment varchar(100),
CONSTRAINT offline_order_pk PRIMARY KEY (ord_no)
);

INSERT INTO offline_order
SELECT
i AS ord_no,
MOD(i, 1000000) AS cust_no,
'2022-01-20'::date + ((i - 1) / 2000)::int AS ord_date,
TO_CHAR('2022-01-20'::date + ((i - 1) / 2000)::int, 'YYYYMMDD') AS ord_dt,
(MOD(i/2, 4) + 1) AS ord_status_cd,
MOD(i, 10000) + 1 AS empno,
LPAD('y', 100, 'y')
FROM generate_series(2, 2000000, 2) AS a(i);

-- 3. Create and populate Order Items (The large fact table)
DROP TABLE IF EXISTS ord_item CASCADE;
CREATE TABLE ord_item (
ord_no bigint NOT NULL,
prod_id varchar(10) NOT NULL,
unit_price int NOT NULL,
quantity int NOT NULL,
on_off_code varchar(2) NOT NULL,
order_comment varchar(100),
CONSTRAINT ord_item_pk PRIMARY KEY (ord_no, prod_id)
);

INSERT INTO ord_item
SELECT a.ord_no, 'prod'||(MOD(ord_no, 200)+1)::text AS prod_id,
TRUNC(100*RANDOM())+1 AS unit_price,
TRUNC(10*RANDOM())+1 AS quantity,
CASE WHEN MOD(ord_no, 2)=0 THEN '01' ELSE '02' END AS on_off_code,
LPAD('c', 100, 'y')
FROM (SELECT ord_no FROM online_order UNION ALL SELECT ord_no FROM offline_order) a;

INSERT INTO ord_item
SELECT a.ord_no, 'prod'||(MOD(ord_no, 200)+2)::text AS prod_id,
TRUNC(100*RANDOM())+1 AS unit_price,
TRUNC(10*RANDOM())+1 AS quantity,
CASE WHEN MOD(ord_no, 2)=0 THEN '01' ELSE '02' END AS on_off_code,
LPAD('d', 100, 'q')
FROM (SELECT ord_no FROM online_order UNION ALL SELECT ord_no FROM offline_order) a;

ANALYZE ord_item, online_order, offline_order;

Suppose we need to analyze the monthly quantity sold for each product within a specific date range. Here is the query we need to investigate:

SELECT TO_CHAR(a.ord_date, 'yyyymm') AS ord_yyyymm,
b.prod_id,
SUM(b.quantity) AS ord_qty_sum
FROM online_order a
, ord_item b
WHERE a.ord_no = b.ord_no
AND a.ord_date BETWEEN '2023-01-20'::date AND '2023-12-20'::date

GROUP BY TO_CHAR(a.ord_date, 'yyyymm'), b.prod_id;

And here is the execution plan:

HashAggregate (actual time=1833.266..1833.684 rows=2400.00 loops=1)
Group Key: to_char((a.ord_date)::timestamp with time zone, 'yyyymm'::text), b.prod_id
Batches: 1 Memory Usage: 1177kB
Buffers: shared hit=16119 read=80036, temp read=17222 written=17222
-> Hash Join (actual time=240.259..1709.837 rows=670000.00 loops=1)
Hash Cond: (b.ord_no = a.ord_no)
Buffers: shared hit=16119 read=80036, temp read=17222 written=17222
-> Seq Scan on ord_item b (actual time=6.528..416.188 rows=4000000.00 loops=1)
Buffers: shared hit=15675 read=61249
-> Hash (actual time=137.279..137.280 rows=335000.00 loops=1)
Buckets: 262144 Batches: 4 Memory Usage: 5965kB
Buffers: shared hit=444 read=18787, temp written=1104
-> Seq Scan on online_order a (actual time=29.125..87.647 rows=335000.00 loops=1)
Filter: ((ord_date >= '2023-01-20'::date) AND (ord_date <= '2023-12-20'::date))
Rows Removed by Filter: 665000
Buffers: shared hit=444 read=18787

Execution Time: 1834.589 ms

If you feed this query and plan into an AI assistant, it will typically guess at generic fixes - suggesing composite indexes, included columns, or modifications to work_mem. It can not catch the core architectural problem.

Diagnosing the Real Problem

Let's take a close look at the execution metrics:
- Building the hash table for online_order takes only 137 ms.
- Scanning the ord_item table takes 416 ms.
- However, the Hash Join node devours the bulk of the time (1709 ms), spilling data to temporary disk space (temp written = 17222)

Why? Because the engine scans all 4,000,000 rows of ord_item, only for the Hash Join to discard the vast majority, keeping just 670,000 rows. If we can prune the rows from ord_item before they reach the join phase, we can drastically accelerate the query.

The Power of Busines Logic: Transitive Pruning

To fix this, we apply a critical piece of business domain knowledge: Order numbers(ord_no) are generated sequentially over time. This means ord_no and ord_date have a strict physical correlation.

Because ord_no directly reflects the progression of time, a specific date range maps to a predictable range of primary keys of ord_no. By finding the MIN and MAX order numbers for our target date range, we can inject a highly restrictive bounds condition onto the ord_item table.

Here is the rewritten query to pre-calculate the key boundaries:

WITH date_range AS (
SELECT MIN(ord_no) AS min_no,
MAX(ord_no) AS max_no
FROM online_order
WHERE ord_date BETWEEN '2023-01-20'::date AND '2023-12-20'::date
)
SELECT TO_CHAR(a.ord_date, 'yyyymm') AS ord_yyyymm,
b.prod_id,
SUM(b.quantity) AS ord_qty_sum
FROM online_order a
, ord_item b
, date_range r
WHERE a.ord_no = b.ord_no
AND b.ord_no BETWEEN r.min_no AND r.max_no -- Drastically restricts the probing set
AND a.ord_date BETWEEN '2023-01-20'::date AND '2023-12-20'::date

GROUP BY TO_CHAR(a.ord_date, 'yyyymm'), b.prod_id;

Let's look at how the planner responds to this structural adjustment:

HashAggregate (actual time=1210.740..1211.105 rows=2400.00 loops=1)
Group Key: to_char((a.ord_date)::timestamp with time zone, 'yyyymm'::text), b.prod_id
Batches: 1 Memory Usage: 1177kB
Buffers: shared hit=9765 read=54989, temp read=6507 written=6507
-> Hash Join (actual time=287.754..1092.058 rows=670000.00 loops=1)
Hash Cond: (b.ord_no = a.ord_no)
Buffers: shared hit=9765 read=54989, temp read=6507 written=6507
-> Nested Loop (actual time=168.697..428.238 rows=1339998.00 loops=1)
Buffers: shared hit=8 read=45515
-> Result (actual time=114.717..114.720 rows=1.00 loops=1)
Buffers: shared hit=7 read=14606
InitPlan 1
-> Limit (actual time=55.844..55.845 rows=1.00 loops=1)
Buffers: shared hit=4 read=8016
-> Index Scan using online_order_pk on online_order (actual time=55.835..55.835 rows=1.00 loops=1)
Filter: ((ord_date >= '2023-01-20'::date) AND (ord_date <= '2023-12-20'::date))
Rows Removed by Filter: 365000
Index Searches: 1
Buffers: shared hit=4 read=8016
InitPlan 2
-> Limit (actual time=48.325..48.326 rows=1.00 loops=1)
Buffers: shared hit=3 read=6590
-> Index Scan Backward using online_order_pk on online_order online_order_1 (actual time=48.321..48.322 rows=1.00 loops=1)
Filter: ((ord_date >= '2023-01-20'::date) AND (ord_date <= '2023-12-20'::date))
Rows Removed by Filter: 300000
Index Searches: 1
Buffers: shared hit=3 read=6590
-> Bitmap Heap Scan on ord_item b (actual time=53.954..197.047 rows=1339998.00 loops=1)
Recheck Cond: ((ord_no >= ((InitPlan 1).col1)) AND (ord_no <= ((InitPlan 2).col1)))
Heap Blocks: exact=25773
Buffers: shared hit=1 read=30909
-> Bitmap Index Scan on ord_item_pk (actual time=50.300..50.300 rows=1339998.00 loops=1)
Index Cond: ((ord_no >= ((InitPlan 1).col1)) AND (ord_no <= ((InitPlan 2).col1)))
Index Searches: 1
Buffers: shared hit=1 read=5136
-> Hash (actual time=118.786..118.787 rows=335000.00 loops=1)
Buckets: 262144 Batches: 4 Memory Usage: 5965kB
Buffers: shared hit=9757 read=9474, temp written=1104
-> Seq Scan on online_order a (actual time=21.236..72.270 rows=335000.00 loops=1)
Filter: ((ord_date >= '2023-01-20'::date) AND (ord_date <= '2023-12-20'::date))
Rows Removed by Filter: 665000
Buffers: shared hit=9757 read=9474
Planning:
Buffers: shared hit=5 read=11
Planning Time: 6.563 ms

Execution Time: 1212.207 ms

By mapping out the key boundaries via InitPlan1 and InitPlan2, the database shifts away from an expensive full-table scan on ord_item.

Instead, it utilizes a highly efficient Bitmap Index Scan, shrinking the dataset passing into the join from 4,000,000 down to just 1,339,998 rows.

AS a result:
- Total elapsed time drops from 1834 ms to 1212 ms (a ~34% performance gain).
- The time spent inside the Hash Join node drops from 1709 ms to 1092 ms due to reduced hashing memory overhead.

Conclusion
Cost-based optimizers judge queries using mathematical models and standard data distribution statistics. They do not understand the chronological real-world meaning behind your columns.
This optimizatioin technique proves that a human developer who understands both database internals and the underlying business rules can easily out-tune an AI. Until AI can intuitively map business realities to data relationships, this kind of SQL tuning remains a uniquely human craft.

E-MAIL: root@postgresdba.com