설문조사
PostgreSQL/PPAS 관련 듣고 싶은 교육은


총 게시물 118건, 최근 0 건
   

How to remove duplicate rows in PostgreSQL

글쓴이 : 모델광 날짜 : 2022-09-18 (일) 20:12 조회 : 45

It is a rare event. Removing duplicate data from a table. But every now and then you may need to do a bit of spring cleaning. If you have a robust data model, every table should have a primary key which prevents the same data from being inserted. And when the table is designed badly the primary key cannot stop a duplicate row from being created.

Here is an example of a bad data model I was recently confronted with. If you already know the data modeling theory like the back of your hand, feel free to skip over to the next paragraph.

In the 상품이력 table rows with the same (이력일시, 상품번호, 상품명, 판매시작일자, 판매금액) values can be inserted. The business key consisting of (이력일시, 상품번호) columns shoud have been nominated as an identifier. The artificial identifier (상품이력일련번호) here says nothing about the actual properties of the data; in particular, the artificial identifier can not identify the row's uniqueness. As such, the only uniqueness the artificail identifier can guarantee is its own one. That is why duplicates can occur in the 상품이력 table; the very thing normalization is trying to avoid. Artificial identifiers are appropriate when there are many child entities associated with the parent entity. The question of which identifier (the business key or the surrogate key) to use is complicated - I am going to dive into some of the complexity and inherent trade-offs in another note.


When duplicate rows are created, it is not easy to detect and delete them. After googling for some ways of deleting duplicate rows, I found this recent blog post:
 

https://sqlfordevs.io/delete-duplicate-rows


The poster introduced a pretty elegant way to erase duplicate rows using a window function to do it. In the example query he has a CTE where he is identifying all of the duplicates and every distinct row gets its own row number to show you how many duplicates there are.


In this note we'll start with a quick recap of how to delete duplicate tuples introduced in the blog. Then look at an alternaive solution you can use in PostgreSQL database to wipe duplicate rows.


The following test table T1 has been created using the system view pg_class:


create TABLE t1

as

select oid, relname, relnamespace

  from pg_class;


Here is what the data in the table T1 looks like.


select * from t1 limit 2;

+-------+--------------+--------------+

| oid   | relname      | relnamespace |

+-------+--------------+--------------+

| 16660 | emp          | 58552        |

| 2619  | pg_statistic | 11           |

+-------+--------------+--------------+


I have inserted a row to make a duplicate row in the table T1.


insert into t1 select 999,'pg_statistic',11;


select * from t1 where relname='pg_statistic';

+------+--------------+--------------+

| oid  | relname      | relnamespace |

+------+--------------+--------------+

| 2619 | pg_statistic  | 11           |

| 999  | pg_statistic  | 11           |

+------+--------------+--------------+


Note that two rows have the same (relname, relnamespace) values. Let us say that the (relname, relnamespace) value is the business key and you have to remove duplicates. Here is an SQL statement Tobias Petry presented followed by its execution plan:


with w as (

select oid, row_number() over (partition by relname, relnamespace

                               order by oid) as rownum

  from t1

 )

delete 

  from t1

 using w

where w.oid = t1.oid

  and w.rownum > 1;


Delete on scott.t1 (actual time=1.040..1.042 rows=0 loops=1)
  Buffers: shared hit=25
  ->  Hash Join (actual time=0.936..1.013 rows=1 loops=1)
        Output: t1.ctid, w.*
        Hash Cond: (t1.oid = w.oid)
        Buffers: shared hit=24
        ->  Seq Scan on scott.t1 (actual time=0.015..0.062 rows=661 loops=1)
              Output: t1.ctid, t1.oid
              Buffers: shared hit=9
        ->  Hash (actual time=0.908..0.910 rows=1 loops=1)
              Output: w.*, w.oid
              Buckets: 1024  Batches: 1  Memory Usage: 9kB
              Buffers: shared hit=15
              ->  Subquery Scan on w (actual time=0.766..0.904 rows=1 loops=1)
                    Output: w.*, w.oid
                    Filter: (w.rownum > 1)
                    Rows Removed by Filter: 660
                    Buffers: shared hit=15
                    ->  WindowAgg (actual time=0.534..0.867 rows=661 loops=1)
                          Output: t1_1.oid, row_number() OVER (?), t1_1.relname, t1_1.relnamespace
                          Buffers: shared hit=15
                          ->  Sort (actual time=0.219..0.245 rows=661 loops=1)
                                Output: t1_1.oid, t1_1.relname, t1_1.relnamespace
                                Sort Key: t1_1.relname, t1_1.relnamespace, t1_1.oid
                                Sort Method: quicksort  Memory: 117kB
                                Buffers: shared hit=15
                                ->  Seq Scan on scott.t1 t1_1 (actual time=0.004..0.039 rows=661 loops=1)
                                      Output: t1_1.oid, t1_1.relname, t1_1.relnamespace
                                      Buffers: shared hit=9
Query Identifier: 7539593811218058663
Planning:
  Buffers: shared hit=28
Planning Time: 0.699 ms
Execution Time: 2.536 ms


In the above SQL statement the rows are partitioned by the columns indicating a duplicate row. Every partition is sorted by a number of columns to mark their importance. In the sample query the partition was sorted by the oid column in ascending order. The sorted records within the partition are assigned an incrementing number by the ROW_NUMBER window function. Any row can be deleted according to the desired number of remaining rows. When for example only one row should be kept, any row with a row number greater than one can be deleted.


You can check whether a duplicate row was deleted by running the following query:


select * from t1 where relname='pg_statistic';

+------+--------------+--------------+

| oid  | relname      | relnamespace |

+------+--------------+--------------+

| 999  | pg_statistic  | 11               |

+------+--------------+--------------+


Likewise, if you want to keep the row with the highest oid, you use the order by oid desc clause instead of the order by oid clause.

But this method presents a problem. The ROW_NUMBER() window function takes a lot of resources. So I propose another solution to speed up the process.


You use the following DELETE statement with a subquery to remove rows whose values in the oid column are the highest ones.


I have made a duplicate row again.


insert into t1 select 2619,'pg_statistic',11;


delete

  from t1 a

 where oid > (select min(b.oid)

                from t1 b

               where 1=1

                 and a.relname = b.relname

                 and a.relnamespace = b.relnamespace);


Delete on scott.t1 a (actual time=23.428..23.429 rows=0 loops=1)
  Buffers: shared hit=5959
  ->  Seq Scan on scott.t1 a (actual time=23.415..23.416 rows=1 loops=1)
        Output: a.ctid
        Filter: (a.oid > (SubPlan 1))
        Rows Removed by Filter: 660
        Buffers: shared hit=5958
        SubPlan 1
          ->  Aggregate (actual time=0.035..0.035 rows=1 loops=661)
                Output: min(b.oid)
                Buffers: shared hit=5949
                ->  Seq Scan on scott.t1 b (actual time=0.016..0.032 rows=1 loops=661)
                      Output: b.oid, b.relname, b.relnamespace
                      Filter: ((a.relname = b.relname) AND (a.relnamespace = b.relnamespace))
                      Rows Removed by Filter: 660
                      Buffers: shared hit=5949
Query Identifier: -3673588318904078456
Planning Time: 0.100 ms
Execution Time: 23.471 ms


Even though the query is returning the expected result, the execution plan is extremely suboptimal. PostgreSQL had to run the subquery 661 times(loops=661). To be honest, as an Oracle guy for a long time I have used the above query. And Oracle produces an efficient plan. If you run the query in Oracle, the optimizer produces the following plan:

--------------------------------------------------
| Id  | Operation                     | Name    | Starts
|   0 | DELETE STATEMENT       |             |      1
|   1 |  DELETE                        | T1         |      1 |
|* 2 |   HASH JOIN                  |             |      1 |
|   3 |    TABLE ACCESS FULL   | T1          |      1 |
|   4 |    VIEW                          | VW_SQ_1 |      1 |
|   5 |     SORT GROUP BY        |                   |      1
|   6 |      TABLE ACCESS FULL | T1              |      1
---------------------------------------------------

Unlike PostgreSQL which access the T1 table 662 times(1+661), Oracle access the T1 table just two times (Id=3, Id=6).


So I rewrote the query in PostgreSQL like the following:

I have re-run the following query to make a duplicate row.


insert into t1 select 2619,'pg_statistic',11;



DELETE
  FROM T1 A
  USING T1 B
 WHERE a.relname = b.relname
     AND a.relnamespace = b.relnamespace
     AND A.OID > B.OID;


Delete on scott.t1 a (actual time=0.342..0.343 rows=0 loops=1)
  Buffers: shared hit=19
  ->  Hash Join (actual time=0.330..0.331 rows=1 loops=1)
        Output: a.ctid, b.ctid
        Hash Cond: ((a.relname = b.relname) AND (a.relnamespace = b.relnamespace))
        Join Filter: (a.oid > b.oid)
        Rows Removed by Join Filter: 662
        Buffers: shared hit=18
        ->  Seq Scan on scott.t1 a (actual time=0.017..0.058 rows=661 loops=1)
              Output: a.ctid, a.relname, a.relnamespace, a.oid
              Buffers: shared hit=9
        ->  Hash (actual time=0.158..0.159 rows=661 loops=1)
              Output: b.ctid, b.relname, b.relnamespace, b.oid
              Buckets: 1024  Batches: 1  Memory Usage: 81kB
              Buffers: shared hit=9
              ->  Seq Scan on scott.t1 b (actual time=0.004..0.066 rows=661 loops=1)
                    Output: b.ctid, b.relname, b.relnamespace, b.oid
                    Buffers: shared hit=9
Query Identifier: -3693632511278988019
Planning Time: 0.127 ms
Execution Time: 0.362 ms


As you can see, compared to the query using the analytic function, there is not really a lot of difference in terms of the number of block IOs between the two - 25 and 19 respectively. But there is an important difference which shows up when you look at the run time - 2.536 ms and 0.363 ms respectively. And compared to the query using a correlated subquery, it is safe to say that you have to find a way to remove correlated subqueries in a delete statment. PostgreSQL is very poor at dealing with correlated subqueries. Looking at the execution plan closely, we can predict that with the same amount of data PostgreSQL would be faster than Oracle because PostgreSQL does not have to do the aggregation(group by) operation.


Now you should know how to remove duplicate records in PostgreSQL.


Addendum

Poor design leads to using a lot of storage. With regard to the sample ERD at the beginning of this note I can not see any reason why an artifical identifier is there. Disk space is cheap but not free. Database space used has some serious multipliers like the following:

- Backups

- Replicas

- Development/Staging/Production Envronment

- WAL files

- Indexes

- Shared buffer cache in memory

Poor design costs performance and money!

   

postgresdba.com