Shrinking the Join Set: Outsmarting AI with Business Knowledge
글쓴이 :
모델광 날짜 : 2026-06-18 (목) 18:11
조회 : 3
In the past SQL tuning was a task peformed by professionals. But with the rise of generative AI, SQL tuning has become a task performed by almost anyone. In most cases, an AI assistant outpeforms a professional sql tuner.
However, there is a class of database optimization where AI still falls short: tuning driven by deep business logic. When a developer truly understands the relationshop between data elements, they can leverage optimization strategies that LLMs simply can not see.
In this article, we will look at how to use transitive predicate generation to drastically reduce the join set and outperform generic AI optimization.
Let's start by creating a realistic test environment. We will simulate an operational schema with online_orders, offline_orders, and a shared ord_items table.
-- Create and populate Online Orders DROP TABLE IF EXISTS online_order CASCADE; CREATE TABLE online_order ( ord_no bigint NOT NULL, cust_no int NOT NULL, ord_date date NOT NULL, ord_dt varchar(8) NOT NULL, ord_status_cd varchar(1) NOT NULL, comment varchar(100), CONSTRAINT online_order_pk PRIMARY KEY (ord_no) );
INSERT INTO online_order SELECT i AS ord_no, MOD(i, 1000000) AS cust_no, '2022-01-20'::date + ((i - 1) / 2000)::int AS ord_date, TO_CHAR('2022-01-20'::date + ((i - 1) / 2000)::int, 'YYYYMMDD') AS ord_dt, (MOD(i, 4) + 1)::varchar(1) AS ord_status_cd, LPAD('x', 100, 'x') AS comment FROM generate_series(1, 2000000, 2) AS a(i);
-- 2. Create and populate Offline Orders DROP TABLE IF EXISTS offline_order CASCADE; CREATE TABLE offline_order ( ord_no bigint NOT NULL, cust_no int NOT NULL, ord_date date NOT NULL, ord_dt varchar(8) NOT NULL, ord_status_cd varchar(1) NOT NULL, empno int, comment varchar(100), CONSTRAINT offline_order_pk PRIMARY KEY (ord_no) );
INSERT INTO offline_order SELECT i AS ord_no, MOD(i, 1000000) AS cust_no, '2022-01-20'::date + ((i - 1) / 2000)::int AS ord_date, TO_CHAR('2022-01-20'::date + ((i - 1) / 2000)::int, 'YYYYMMDD') AS ord_dt, (MOD(i/2, 4) + 1) AS ord_status_cd, MOD(i, 10000) + 1 AS empno, LPAD('y', 100, 'y') FROM generate_series(2, 2000000, 2) AS a(i);
-- 3. Create and populate Order Items (The large fact table) DROP TABLE IF EXISTS ord_item CASCADE; CREATE TABLE ord_item ( ord_no bigint NOT NULL, prod_id varchar(10) NOT NULL, unit_price int NOT NULL, quantity int NOT NULL, on_off_code varchar(2) NOT NULL, order_comment varchar(100), CONSTRAINT ord_item_pk PRIMARY KEY (ord_no, prod_id) );
INSERT INTO ord_item SELECT a.ord_no, 'prod'||(MOD(ord_no, 200)+1)::text AS prod_id, TRUNC(100*RANDOM())+1 AS unit_price, TRUNC(10*RANDOM())+1 AS quantity, CASE WHEN MOD(ord_no, 2)=0 THEN '01' ELSE '02' END AS on_off_code, LPAD('c', 100, 'y') FROM (SELECT ord_no FROM online_order UNION ALL SELECT ord_no FROM offline_order) a;
INSERT INTO ord_item SELECT a.ord_no, 'prod'||(MOD(ord_no, 200)+2)::text AS prod_id, TRUNC(100*RANDOM())+1 AS unit_price, TRUNC(10*RANDOM())+1 AS quantity, CASE WHEN MOD(ord_no, 2)=0 THEN '01' ELSE '02' END AS on_off_code, LPAD('d', 100, 'q') FROM (SELECT ord_no FROM online_order UNION ALL SELECT ord_no FROM offline_order) a;
ANALYZE ord_item, online_order, offline_order;
Suppose we need to analyze the monthly quantity sold for each product within a specific date range. Here is the query we need to investigate:
SELECT TO_CHAR(a.ord_date, 'yyyymm') AS ord_yyyymm, b.prod_id, SUM(b.quantity) AS ord_qty_sum FROM online_order a , ord_item b WHERE a.ord_no = b.ord_no AND a.ord_date BETWEEN '2023-01-20'::date AND '2023-12-20'::date
GROUP BY TO_CHAR(a.ord_date, 'yyyymm'), b.prod_id;
If you feed this query and plan into an AI assistant, it will typically guess at generic fixes - suggesing composite indexes, included columns, or modifications to work_mem. It can not catch the core architectural problem.
Diagnosing the Real Problem
Let's take a close look at the execution metrics: - Building the hash table for online_order takes only 137 ms. - Scanning the ord_item table takes 416 ms. - However, the Hash Join node devours the bulk of the time (1709 ms), spilling data to temporary disk space (temp written = 17222)
Why? Because the engine scans all 4,000,000 rows of ord_item, only for the Hash Join to discard the vast majority, keeping just 670,000 rows. If we can prune the rows from ord_item before they reach the join phase, we can drastically accelerate the query.
The Power of Busines Logic: Transitive Pruning
To fix this, we apply a critical piece of business domain knowledge: Order numbers(ord_no) are generated sequentially over time. This means ord_no and ord_date have a strict physical correlation.
Because ord_no directly reflects the progression of time, a specific date range maps to a predictable range of primary keys of ord_no. By finding the MIN and MAX order numbers for our target date range, we can inject a highly restrictive bounds condition onto the ord_item table.
Here is the rewritten query to pre-calculate the key boundaries:
WITH date_range AS ( SELECT MIN(ord_no) AS min_no, MAX(ord_no) AS max_no FROM online_order WHERE ord_date BETWEEN '2023-01-20'::date AND '2023-12-20'::date ) SELECT TO_CHAR(a.ord_date, 'yyyymm') AS ord_yyyymm, b.prod_id, SUM(b.quantity) AS ord_qty_sum FROM online_order a , ord_item b , date_range r WHERE a.ord_no = b.ord_no AND b.ord_no BETWEEN r.min_no AND r.max_no -- Drastically restricts the probing set AND a.ord_date BETWEEN '2023-01-20'::date AND '2023-12-20'::date
GROUP BY TO_CHAR(a.ord_date, 'yyyymm'), b.prod_id;
Let's look at how the planner responds to this structural adjustment:
HashAggregate (actual time=1210.740..1211.105 rows=2400.00 loops=1) Group Key: to_char((a.ord_date)::timestamp with time zone, 'yyyymm'::text), b.prod_id Batches: 1 Memory Usage: 1177kB Buffers: shared hit=9765 read=54989, temp read=6507 written=6507 -> Hash Join (actual time=287.754..1092.058 rows=670000.00 loops=1) Hash Cond: (b.ord_no = a.ord_no) Buffers: shared hit=9765 read=54989, temp read=6507 written=6507 -> Nested Loop (actual time=168.697..428.238 rows=1339998.00 loops=1) Buffers: shared hit=8 read=45515 -> Result (actual time=114.717..114.720 rows=1.00 loops=1) Buffers: shared hit=7 read=14606 InitPlan 1 -> Limit (actual time=55.844..55.845 rows=1.00 loops=1) Buffers: shared hit=4 read=8016 -> Index Scan using online_order_pk on online_order (actual time=55.835..55.835 rows=1.00 loops=1) Filter: ((ord_date >= '2023-01-20'::date) AND (ord_date <= '2023-12-20'::date)) Rows Removed by Filter: 365000 Index Searches: 1 Buffers: shared hit=4 read=8016 InitPlan 2 -> Limit (actual time=48.325..48.326 rows=1.00 loops=1) Buffers: shared hit=3 read=6590 -> Index Scan Backward using online_order_pk on online_order online_order_1 (actual time=48.321..48.322 rows=1.00 loops=1) Filter: ((ord_date >= '2023-01-20'::date) AND (ord_date <= '2023-12-20'::date)) Rows Removed by Filter: 300000 Index Searches: 1 Buffers: shared hit=3 read=6590 -> Bitmap Heap Scan on ord_item b (actual time=53.954..197.047 rows=1339998.00 loops=1) Recheck Cond: ((ord_no >= ((InitPlan 1).col1)) AND (ord_no <= ((InitPlan 2).col1))) Heap Blocks: exact=25773 Buffers: shared hit=1 read=30909 -> Bitmap Index Scan on ord_item_pk (actual time=50.300..50.300 rows=1339998.00 loops=1) Index Cond: ((ord_no >= ((InitPlan 1).col1)) AND (ord_no <= ((InitPlan 2).col1))) Index Searches: 1 Buffers: shared hit=1 read=5136 -> Hash (actual time=118.786..118.787 rows=335000.00 loops=1) Buckets: 262144 Batches: 4 Memory Usage: 5965kB Buffers: shared hit=9757 read=9474, temp written=1104 -> Seq Scan on online_order a (actual time=21.236..72.270 rows=335000.00 loops=1) Filter: ((ord_date >= '2023-01-20'::date) AND (ord_date <= '2023-12-20'::date)) Rows Removed by Filter: 665000 Buffers: shared hit=9757 read=9474 Planning: Buffers: shared hit=5 read=11 Planning Time: 6.563 ms
Execution Time: 1212.207 ms
By mapping out the key boundaries via InitPlan1 and InitPlan2, the database shifts away from an expensive full-table scan on ord_item.
Instead, it utilizes a highly efficient Bitmap Index Scan, shrinking the dataset passing into the join from 4,000,000 down to just 1,339,998 rows.
AS a result: - Total elapsed time drops from 1834 ms to 1212 ms (a ~34% performance gain). - The time spent inside the Hash Join node drops from 1709 ms to 1092 ms due to reduced hashing memory overhead.
Conclusion Cost-based optimizers judge queries using mathematical models and standard data distribution statistics. They do not understand the chronological real-world meaning behind your columns. This optimizatioin technique proves that a human developer who understands both database internals and the underlying business rules can easily out-tune an AI. Until AI can intuitively map business realities to data relationships, this kind of SQL tuning remains a uniquely human craft.