설문조사
PostgreSQL/PPAS 관련 듣고 싶은 교육은


총 게시물 182건, 최근 1 건
 

Shrinking the Join Set: Outsmarting AI with Business Knowledge

글쓴이 : 모델광 날짜 : 2026-06-18 (목) 18:11 조회 : 3
In the past SQL tuning was a task peformed by professionals. But with the rise of generative AI, SQL tuning has become a task performed by almost anyone. In most cases, an AI assistant outpeforms a professional sql tuner.

However, there is a class of database optimization where AI still falls short: tuning driven by deep business logic. When a developer truly understands the relationshop between data elements, they can leverage optimization strategies that LLMs simply can not see.​

In this article, we will look at how to use transitive predicate generation to drastically reduce the join set and outperform generic AI optimization.

Let's start by creating a realistic test environment. We will simulate an operational schema with online_orders, offline_orders, and a shared ord_items table.

-- Create and populate Online Orders
DROP TABLE IF EXISTS online_order CASCADE;
CREATE TABLE online_order (
    ord_no        bigint       NOT NULL,
    cust_no       int          NOT NULL,
    ord_date      date         NOT NULL,
    ord_dt        varchar(8)   NOT NULL,
    ord_status_cd varchar(1)   NOT NULL,
    comment       varchar(100),
    CONSTRAINT online_order_pk PRIMARY KEY (ord_no)
);

INSERT INTO online_order
SELECT 
    i AS ord_no,
    MOD(i, 1000000) AS cust_no,
    '2022-01-20'::date + ((i - 1) / 2000)::int AS ord_date,
    TO_CHAR('2022-01-20'::date + ((i - 1) / 2000)::int, 'YYYYMMDD') AS ord_dt,
    (MOD(i, 4) + 1)::varchar(1) AS ord_status_cd,
    LPAD('x', 100, 'x') AS comment
FROM generate_series(1, 2000000, 2) AS a(i);

-- 2. Create and populate Offline Orders
DROP TABLE IF EXISTS offline_order CASCADE;
CREATE TABLE offline_order (
    ord_no        bigint       NOT NULL,
    cust_no       int          NOT NULL,
    ord_date      date         NOT NULL,
    ord_dt        varchar(8)   NOT NULL,
    ord_status_cd varchar(1)   NOT NULL,
    empno         int,
    comment       varchar(100),
    CONSTRAINT offline_order_pk PRIMARY KEY (ord_no)
);

INSERT INTO offline_order
SELECT 
    i AS ord_no,
    MOD(i, 1000000) AS cust_no,
    '2022-01-20'::date + ((i - 1) / 2000)::int AS ord_date,
    TO_CHAR('2022-01-20'::date + ((i - 1) / 2000)::int, 'YYYYMMDD') AS ord_dt,
    (MOD(i/2, 4) + 1) AS ord_status_cd,
    MOD(i, 10000) + 1 AS empno,
    LPAD('y', 100, 'y')
FROM generate_series(2, 2000000, 2) AS a(i);

-- 3. Create and populate Order Items (The large fact table)
DROP TABLE IF EXISTS ord_item CASCADE;
CREATE TABLE ord_item (
    ord_no       bigint      NOT NULL,
    prod_id      varchar(10) NOT NULL,
    unit_price   int         NOT NULL,
    quantity     int         NOT NULL,
    on_off_code  varchar(2)  NOT NULL,
    order_comment varchar(100),
    CONSTRAINT ord_item_pk PRIMARY KEY (ord_no, prod_id)
);

INSERT INTO ord_item
SELECT a.ord_no, 'prod'||(MOD(ord_no, 200)+1)::text AS prod_id,
       TRUNC(100*RANDOM())+1 AS unit_price,
       TRUNC(10*RANDOM())+1 AS quantity,
       CASE WHEN MOD(ord_no, 2)=0 THEN '01' ELSE '02' END AS on_off_code,
       LPAD('c', 100, 'y')
FROM (SELECT ord_no FROM online_order UNION ALL SELECT ord_no FROM offline_order) a;

INSERT INTO ord_item
SELECT a.ord_no, 'prod'||(MOD(ord_no, 200)+2)::text AS prod_id,
       TRUNC(100*RANDOM())+1 AS unit_price,
       TRUNC(10*RANDOM())+1 AS quantity,
       CASE WHEN MOD(ord_no, 2)=0 THEN '01' ELSE '02' END AS on_off_code,
       LPAD('d', 100, 'q')
FROM (SELECT ord_no FROM online_order UNION ALL SELECT ord_no FROM offline_order) a;

ANALYZE ord_item, online_order, offline_order;

Suppose we need to analyze the monthly quantity sold for each product within a specific date range. Here is the query we need to investigate:

SELECT TO_CHAR(a.ord_date, 'yyyymm') AS ord_yyyymm, 
       b.prod_id, 
       SUM(b.quantity) AS ord_qty_sum
  FROM online_order a 
     , ord_item b 
 WHERE a.ord_no = b.ord_no 
   AND a.ord_date BETWEEN '2023-01-20'::date AND '2023-12-20'::date 
 GROUP BY TO_CHAR(a.ord_date, 'yyyymm'), b.prod_id;

And here is the execution plan:

HashAggregate (actual time=1833.266..1833.684 rows=2400.00 loops=1)
  Group Key: to_char((a.ord_date)::timestamp with time zone, 'yyyymm'::text), b.prod_id
  Batches: 1  Memory Usage: 1177kB
  Buffers: shared hit=16119 read=80036, temp read=17222 written=17222
  ->  Hash Join (actual time=240.259..1709.837 rows=670000.00 loops=1)
        Hash Cond: (b.ord_no = a.ord_no)
        Buffers: shared hit=16119 read=80036, temp read=17222 written=17222
        ->  Seq Scan on ord_item b (actual time=6.528..416.188 rows=4000000.00 loops=1)
              Buffers: shared hit=15675 read=61249
        ->  Hash (actual time=137.279..137.280 rows=335000.00 loops=1)
              Buckets: 262144  Batches: 4  Memory Usage: 5965kB
              Buffers: shared hit=444 read=18787, temp written=1104
              ->  Seq Scan on online_order a (actual time=29.125..87.647 rows=335000.00 loops=1)
                    Filter: ((ord_date >= '2023-01-20'::date) AND (ord_date <= '2023-12-20'::date))
                    Rows Removed by Filter: 665000
                    Buffers: shared hit=444 read=18787
Execution Time: 1834.589 ms

If you feed this query and plan into an AI assistant, it will typically guess at generic fixes - suggesing composite indexes, included columns, or modifications to work_mem. It can not catch the core architectural problem.

Diagnosing the Real Problem

Let's take a close look at the execution metrics:
 - Building the hash table for online_order takes only 137 ms.
 - Scanning the ord_item table takes 416 ms.
 - However, the Hash Join node devours the bulk of the time (1709 ms), spilling data to temporary disk space (temp written = 17222)

Why? Because the engine scans all 4,000,000 rows of ord_item, only for the Hash Join to discard the vast majority, keeping just 670,000 rows. If we can prune the rows from ord_item before they reach the join phase, we can drastically accelerate the query.

The Power of Busines Logic: Transitive Pruning

To fix this, we apply a critical piece of business domain knowledge: Order numbers(ord_no) are generated sequentially over time. This means ord_no and ord_date have a strict physical correlation.

Because ord_no directly reflects the progression of time, a specific date range maps to a predictable range of primary keys of ord_no. By finding the MIN and MAX order numbers for our target date range, we can inject a highly restrictive bounds condition onto the ord_item table.

Here is the rewritten query to pre-calculate the key boundaries:

WITH date_range AS (
    SELECT MIN(ord_no) AS min_no, 
           MAX(ord_no) AS max_no
      FROM online_order
     WHERE ord_date BETWEEN '2023-01-20'::date AND '2023-12-20'::date
)
SELECT TO_CHAR(a.ord_date, 'yyyymm') AS ord_yyyymm, 
       b.prod_id, 
       SUM(b.quantity) AS ord_qty_sum
  FROM online_order a
     , ord_item b
     , date_range r  
 WHERE a.ord_no = b.ord_no 
   AND b.ord_no BETWEEN r.min_no AND r.max_no        -- Drastically restricts the probing set
   AND a.ord_date BETWEEN '2023-01-20'::date AND '2023-12-20'::date 
 GROUP BY TO_CHAR(a.ord_date, 'yyyymm'), b.prod_id;

Let's look at how the planner responds to this structural adjustment:

HashAggregate (actual time=1210.740..1211.105 rows=2400.00 loops=1)
  Group Key: to_char((a.ord_date)::timestamp with time zone, 'yyyymm'::text), b.prod_id
  Batches: 1  Memory Usage: 1177kB
  Buffers: shared hit=9765 read=54989, temp read=6507 written=6507
  ->  Hash Join (actual time=287.754..1092.058 rows=670000.00 loops=1)
        Hash Cond: (b.ord_no = a.ord_no)
        Buffers: shared hit=9765 read=54989, temp read=6507 written=6507
        ->  Nested Loop (actual time=168.697..428.238 rows=1339998.00 loops=1)
              Buffers: shared hit=8 read=45515
              ->  Result (actual time=114.717..114.720 rows=1.00 loops=1)
                    Buffers: shared hit=7 read=14606
                    InitPlan 1
                      ->  Limit (actual time=55.844..55.845 rows=1.00 loops=1)
                            Buffers: shared hit=4 read=8016
                            ->  Index Scan using online_order_pk on online_order (actual time=55.835..55.835 rows=1.00 loops=1)
                                  Filter: ((ord_date >= '2023-01-20'::date) AND (ord_date <= '2023-12-20'::date))
                                  Rows Removed by Filter: 365000
                                  Index Searches: 1
                                  Buffers: shared hit=4 read=8016
                    InitPlan 2
                      ->  Limit (actual time=48.325..48.326 rows=1.00 loops=1)
                            Buffers: shared hit=3 read=6590
                            ->  Index Scan Backward using online_order_pk on online_order online_order_1 (actual time=48.321..48.322 rows=1.00 loops=1)
                                  Filter: ((ord_date >= '2023-01-20'::date) AND (ord_date <= '2023-12-20'::date))
                                  Rows Removed by Filter: 300000
                                  Index Searches: 1
                                  Buffers: shared hit=3 read=6590
              ->  Bitmap Heap Scan on ord_item b (actual time=53.954..197.047 rows=1339998.00 loops=1)
                    Recheck Cond: ((ord_no >= ((InitPlan 1).col1)) AND (ord_no <= ((InitPlan 2).col1)))
                    Heap Blocks: exact=25773
                    Buffers: shared hit=1 read=30909
                    ->  Bitmap Index Scan on ord_item_pk (actual time=50.300..50.300 rows=1339998.00 loops=1)
                          Index Cond: ((ord_no >= ((InitPlan 1).col1)) AND (ord_no <= ((InitPlan 2).col1)))
                          Index Searches: 1
                          Buffers: shared hit=1 read=5136
        ->  Hash (actual time=118.786..118.787 rows=335000.00 loops=1)
              Buckets: 262144  Batches: 4  Memory Usage: 5965kB
              Buffers: shared hit=9757 read=9474, temp written=1104
              ->  Seq Scan on online_order a (actual time=21.236..72.270 rows=335000.00 loops=1)
                    Filter: ((ord_date >= '2023-01-20'::date) AND (ord_date <= '2023-12-20'::date))
                    Rows Removed by Filter: 665000
                    Buffers: shared hit=9757 read=9474
Planning:
  Buffers: shared hit=5 read=11
Planning Time: 6.563 ms
Execution Time: 1212.207 ms

By mapping out the key boundaries via InitPlan1 and InitPlan2, the database shifts away from an expensive full-table scan on ord_item.

Instead, it utilizes a highly efficient Bitmap Index Scan, shrinking the dataset passing into the join from 4,000,000 down to just 1,339,998 rows.

AS a result:
 - Total elapsed time drops from 1834 ms to 1212 ms (a ~34% performance gain).
 - The time spent inside the Hash Join node drops from 1709 ms to 1092 ms due to reduced hashing memory overhead.

Conclusion
Cost-based optimizers judge queries using mathematical models and standard data distribution statistics. They do not understand the chronological real-world meaning behind your columns.
This optimizatioin technique proves that a human developer who understands both database internals and the underlying business rules can easily out-tune an AI. Until AI can intuitively map business realities to data relationships, this kind of SQL tuning remains a uniquely human craft.

 

postgresdba.com