설문조사
PostgreSQL/PPAS 관련 듣고 싶은 교육은


총 게시물 125건, 최근 1 건
   

NOT IN vs. NOT EXISTS

글쓴이 : 모델광 날짜 : 2022-07-31 (일) 20:39 조회 : 307

In this note, I will give you a reason why we should not use the NOT IN clause in a subquery.
When I have to pvovide developers with guidance on writing queries, I usually say,
"While writing a query, you can suppose that NOT EXISTS will provide better results since it can use all the logic and optimization for joining two tables whereas NOT IN operator will result in using SubPlans. SubPlan is like filtering operation in Oracle and it will create a bottleneck when we deal with large datasets. The NOT EXISTS clause, in its turn, will lead to anti joins without any Subplans."

Last year when I gave this guidance to a project team, a curious developer performed an experiment and provided the test result, saying "You may be wrong. The NOT IN clause is as performant as the NOT EXISTS clause."

Here is the simplified demonstration script (shown below with some cosmetic changes) the suspicious developer offered.


create table ORDERS (ord_no bigint,cust_id varchar(20),comment varchar(100),ord_date varchar(8));

ALTER TABLE ORDERS ADD CONSTRAINT PK_ORDERS PRIMARY KEY(ORD_NO);

CREATE TABLE ORDERS_DETAIL(ORD_LINE_NO BIGINT NOT NULL,ORD_NO BIGINT NOT NULL,PROD_ID VARCHAR(10) NOT NULL,COMMENT VARCHAR(100),ORD_AMT BIGINT);

ALTER TABLE ORDERS_DETAIL ADD CONSTRAINT PK_ORDERS_DETAIL PRIMARY KEY(ORD_LINE_NO);
insert into ORDERS
select i as ord_no    
        , 'C'||mod(i,10) as cust_id    
        , lpad('X',10,'Y') as comment     
        , to_char(to_date('20191001','YYYYMMDD')+mod(i,60),'yyyymmdd') as ord_date
  from generate_series(1,1000000) a(i);

INSERT INTO ORDERS_DETAIL
SELECT i as ORD_LINE_NO   
          , mod(i,1000000) AS ORD_NO   
          , 'PP'||MOD(i,5) AS PROD_ID   
          , lpad('X',10,'Y') as comment
      , case when i < 1000 then i*100 else i end as prod_amt
  FROM generate_series(1,10000000) a(i); 


show work_mem;   --Note that the work_mem is set to 64 Mbytes.

64m


--I have disabled parallel processing for the purpose of simplicity.

SET max_parallel_workers_per_gather=0;   


Here is the query in question. I am using the ORDERS_DETAIL and ORDERS tables to demonstrate a point about subquery executioin.


SELECT A.ORD_LINE_NO, A.ORD_AMT  
  FROM ORDERS_DETAIL A
WHERE A.ORD_NO NOT IN (SELECT B.ORD_NO
                                             FROM ORDERS B
                                          WHERE ORD_DATE > '20181220'); 


Here is the plan I got when I ran the query:


Seq Scan on orders_detail a (actual time=786.107..2948.458 rows=10 loops=1)
  Filter: (NOT (hashed SubPlan 1))
  Rows Removed by Filter: 9999990
  Buffers: shared hit=7609 read=83078
  SubPlan 1
    ->  Seq Scan on orders b (actual time=0.037..118.264 rows=1000000 loops=1)
          Filter: ((ord_date)::text > '20181220'::text)
          Buffers: shared hit=7353
Planning Time: 0.077 ms
Execution Time: 2949.921 ms 


The plan was a little surprising and not what I should have expected.

Note that the planner decided to use hashed Subplan, not plain Subplan. Hashed Subplan builds a hash table over the ORD_NO's from the table ORDERS and searches the ORD_NOs from the table ORDERS_DETAIL in this hash table.

Unlike what I explained in the guidance, we can't observe any inefficiency in the execution plan.


The following is the query using the NOT EXISTS clause followed by its execution plan.


SELECT A.ORD_LINE_NO, A.ORD_AMT
   FROM ORDERS_DETAIL A
WHERE NOT EXISTS (SELECT 1
                                    FROM ORDERS B
                                 WHERE A.ORD_NO = B.ORD_NO
                                                AND ORD_DATE > '20181220');

Hash Anti Join (actual time=641.435..3577.345 rows=10 loops=1)
  Hash Cond: (a.ord_no = b.ord_no)
  Buffers: shared hit=7673 read=83014
  ->  Seq Scan on orders_detail a (actual time=0.054..715.294 rows=10000000 loops=1)
        Buffers: shared hit=320 read=83014
  ->  Hash (actual time=315.512..315.513 rows=1000000 loops=1)
        Buckets: 1048576  Batches: 1  Memory Usage: 47255kB
        Buffers: shared hit=7353
        ->  Seq Scan on orders b (actual time=0.015..121.175 rows=1000000 loops=1)
              Filter: ((ord_date)::text > '20181220'::text)
              Buffers: shared hit=7353
Planning:
  Buffers: shared hit=8
Planning Time: 0.190 ms
Execution Time: 3578.991 ms

When we used the NOT EXISTS clause, the optimizer decided to use hash anti join as expected. As you can see, the query with the NOT IN clause is a little faster than the query with the NOT EXISTS clause.(2949 ms -> 3578 ms)


I was somewhat bewildered. So I decided to dive into the details of NOT EXISTS and NOT IN. And the following are my findings. 


Sometimes hashed Subplan is faster than hash join and than all the other options, as it preserves the order.

Hashed Subplan will not be used if it would need more than one batch which means that when the parameter value of work_mem is small, we would get plain Subplan. Batching a hash join requires freedom to reorder the rows on both sides of the join. A SupPlan, by definition, must deliver the correct answer for the current outer row on-demand.


The drawback of the first query with the NOT IN clause is that a small decrease in work_mem or a small increase in the data size may lead to a dramatic change in a query elapsed time. By contrast, anti hash join handles multiple batches better. In the above example, only one batch was used. So when the NOT IN clause was in the query, the planner was able to use hashed Subplan.


Now let us reduce the size of work_mem and check what happens to the exeuction plan.


set work_mem=3200;


SELECT A.ORD_LINE_NO, A.ORD_AMT  
  FROM ORDERS_DETAIL A
WHERE A.ORD_NO NOT IN (SELECT B.ORD_NO
                                             FROM ORDERS B
                                          WHERE ORD_DATE > '20181220');


Gather  (cost=1000.00..65125636417.33 rows=5000000 width=16)
  Workers Planned: 2
  ->  Parallel Seq Scan on orders_detail a  (cost=0.00..65125135417.33 rows=2083333 width=16)
        Filter: (NOT (SubPlan 1))
        SubPlan 1
          ->  Materialize  (cost=0.00..28760.00 rows=1000000 width=8)
                ->  Seq Scan on orders b  (cost=0.00..19853.00 rows=1000000 width=8)
                      Filter: ((ord_date)::text > '20181220'::text)


When I ran the query above, I couldn't get the result, it just hung. So I couldn't get actual beffer I/Os and elapsed time in the execution plan.

The subquery is evaluated once for each row selected by the main query. So we can infer that the subquery will be evaluated 5000000 times which is why I couldn't see the output when I ran the above query. In Oracle the filter subquery can be efficient when the number of distinct input values from the main query is small because Oracle has "scalar subquery caching" mechanism. But PostgreSQL does not have such mechanism and it cannot minimise the number of executions of the subquery.


Here is the execution plan I got when I used the NOT EXISTS clause.


SELECT A.ORD_LINE_NO, A.ORD_AMT
   FROM ORDERS_DETAIL A
WHERE NOT EXISTS (SELECT 1
                                    FROM ORDERS B
                                 WHERE A.ORD_NO = B.ORD_NO
                                                AND ORD_DATE > '20181220');


Hash Anti Join (actual time=4124.142..4162.771 rows=10 loops=1)
  Hash Cond: (a.ord_no = b.ord_no)
  Buffers: shared hit=8793 read=81894, temp read=55361 written=55361
  ->  Seq Scan on orders_detail a (actual time=0.017..1108.315 rows=10000000 loops=1)
        Buffers: shared hit=1440 read=81894
  ->  Hash (actual time=276.168..276.169 rows=1000000 loops=1)
        Buckets: 131072  Batches: 32  Memory Usage: 2253kB
        Buffers: shared hit=7353, temp written=3292
        ->  Seq Scan on orders b (actual time=0.030..135.584 rows=1000000 loops=1)
              Filter: ((ord_date)::text > '20181220'::text)
              Buffers: shared hit=7353
Planning:
  Buffers: shared hit=8
Planning Time: 0.421 ms
Execution Time: 4163.090 ms


Because my work_mem setting is small, the optimizer used multiple batches (batch:32) resuling in disk IOs(temp_writtten=32). And the elapsed time rose from 3578.991 ms to 4163.090 ms. Anyhow, we could get the output, whereas we nearly failed to get the output when we used the NOT IN clause in a subquery.

I hope PostgreSQL will be able to transform the query with the NOT IN clause efficiently in the future version.


Conclusion

If the dataset of the subquery is small, the query with the NOT IN clause can perform well with hashed Subplan. But as the dataset overgrows a certain limit, PostgreSQL will resort to using Subplan. And one day the query will become extremely slow without any apparent reason.  In practice, it is better to be safe than sorry, so do not use NOT IN. Use Not Exists. That would be a belt and braces approach.


Addendum

There is one more reason why we should not use NOT IN. There is a risk of Out Of Memroy with a hashed subplan.

Let's assume that the statistics on the ORDERS table is obslolete and the planner underestimates the size of the table. In that case the planner may choose hashed SubPlan even though the work_mem setting is small. I have a vague memory that PostgreSQL cannot flush a hashed subplan onto the disk so it increases work_mem. If the actual dataset from the ORDERS table is much bigger, the increased work_mem may incur Out Of Memory. In the past HashAggregate had the same problem of risking Out of Memory as well. But since PostgreSQL 13 it is possible to make a hash aggregate spill data to disk  if more than work_mem is used up for a hash aggregate.


   

postgresdba.com