It is a rare event. Removing duplicate data from a table.
But every now and then you may need to do a bit of spring cleaning. If
you have a robust data model, every table should have a primary key
which prevents the same data from being inserted. And when the table is
designed badly the primary key cannot stop a duplicate row from being
created.
Here is an example of a bad data model I was recently confronted with. If you already know the data modeling
theory like the back of your hand, feel free to skip over to the next
paragraph.

In
the 상품이력 table rows with the same (이력일시, 상품번호, 상품명, 판매시작일자, 판매금액)
values can be inserted. The business key consisting of (이력일시, 상품번호)
columns shoud have been nominated as an identifier. The artificial
identifier (상품이력일련번호) here says nothing about the actual properties of
the data; in particular, the artificial identifier can not identify the row's uniqueness.
As such, the only uniqueness the artificail identifier can guarantee is
its own one. That is why duplicates can occur in the 상품이력 table; the
very thing normalization is trying to avoid. Artificial identifiers are
appropriate when there are many child entities associated with the
parent entity. The question of which identifier (the business key or the
surrogate key) to use is complicated - I am going to dive into some of
the complexity and inherent trade-offs in another note.
When
duplicate rows are created, it is not easy to detect and delete them.
After googling for some ways of deleting duplicate rows, I found this
recent blog post:
https://sqlfordevs.io/delete-duplicate-rows
The
poster introduced a pretty elegant way to erase duplicate rows using a
window function to do it. In the example query he has a CTE where he is
identifying all of the duplicates and every distinct row gets its own
row number to show you how many duplicates there are.
In
this note we'll start with a quick recap of how to delete duplicate
tuples introduced in the blog. Then look at an alternaive solution you
can use in PostgreSQL database to wipe duplicate rows.
The following test table T1 has been created using the system view pg_class:
create TABLE t1
as
select oid, relname, relnamespace
from pg_class;
Here is what the data in the table T1 looks like.
select * from t1 limit 2;
+-------+--------------+--------------+
| oid | relname | relnamespace |
+-------+--------------+--------------+
| 16660 | emp | 58552 |
| 2619 | pg_statistic | 11 |
+-------+--------------+--------------+
I have inserted a row to make a duplicate row in the table T1.
insert into t1 select 999,'pg_statistic',11;
select * from t1 where relname='pg_statistic';
+------+--------------+--------------+
| oid | relname | relnamespace |
+------+--------------+--------------+
| 2619 | pg_statistic | 11 |
| 999 | pg_statistic | 11 |
+------+--------------+--------------+
Note
that two rows have the same (relname, relnamespace) values. Let us say
that the (relname, relnamespace) value is the business key and you have
to remove duplicates. Here is an SQL statement Tobias Petry presented
followed by its execution plan:
with w as (
select oid, row_number() over (partition by relname, relnamespace
order by oid) as rownum
from t1
)
delete
from t1
using w
where w.oid = t1.oid
and w.rownum > 1;
Delete on scott.t1 (actual time=1.040..1.042 rows=0 loops=1)
Buffers: shared hit=25
-> Hash Join (actual time=0.936..1.013 rows=1 loops=1)
Output: t1.ctid, w.*
Hash Cond: (t1.oid = w.oid)
Buffers: shared hit=24
-> Seq Scan on scott.t1 (actual time=0.015..0.062 rows=661 loops=1)
Output: t1.ctid, t1.oid
Buffers: shared hit=9
-> Hash (actual time=0.908..0.910 rows=1 loops=1)
Output: w.*, w.oid
Buckets: 1024 Batches: 1 Memory Usage: 9kB
Buffers: shared hit=15
-> Subquery Scan on w (actual time=0.766..0.904 rows=1 loops=1)
Output: w.*, w.oid
Filter: (w.rownum > 1)
Rows Removed by Filter: 660
Buffers: shared hit=15
-> WindowAgg (actual time=0.534..0.867 rows=661 loops=1)
Output: t1_1.oid, row_number() OVER (?), t1_1.relname, t1_1.relnamespace
Buffers: shared hit=15
-> Sort (actual time=0.219..0.245 rows=661 loops=1)
Output: t1_1.oid, t1_1.relname, t1_1.relnamespace
Sort Key: t1_1.relname, t1_1.relnamespace, t1_1.oid
Sort Method: quicksort Memory: 117kB
Buffers: shared hit=15
-> Seq Scan on scott.t1 t1_1 (actual time=0.004..0.039 rows=661 loops=1)
Output: t1_1.oid, t1_1.relname, t1_1.relnamespace
Buffers: shared hit=9
Query Identifier: 7539593811218058663
Planning:
Buffers: shared hit=28
Planning Time: 0.699 ms
Execution Time: 2.536 ms
In
the above SQL statement the rows are partitioned by the columns
indicating a duplicate row. Every partition is sorted by a number of
columns to mark their importance. In the sample query the partition was
sorted by the oid column in ascending order. The sorted records within
the partition are assigned an incrementing number by the ROW_NUMBER
window function. Any row can be deleted according to the desired number
of remaining rows. When for example only one row should be kept, any row
with a row number greater than one can be deleted.
You can check whether a duplicate row was deleted by running the following query:
select * from t1 where relname='pg_statistic';
+------+--------------+--------------+
| oid | relname | relnamespace |
+------+--------------+--------------+
| 999 | pg_statistic | 11 |
+------+--------------+--------------+
Likewise, if you want to keep the row with the highest oid, you use the order by oid desc clause instead of the order by oid clause.
But this method presents a problem. The ROW_NUMBER() window function takes a lot of resources. So I propose another solution to speed up the process.
You use the following DELETE statement with a subquery to remove rows whose values in the oid column are the highest ones.
I have made a duplicate row again.
insert into t1 select 2619,'pg_statistic',11;
delete
from t1 a
where oid > (select min(b.oid)
from t1 b
where 1=1
and a.relname = b.relname
and a.relnamespace = b.relnamespace);
Delete on scott.t1 a (actual time=23.428..23.429 rows=0 loops=1)
Buffers: shared hit=5959
-> Seq Scan on scott.t1 a (actual time=23.415..23.416 rows=1 loops=1)
Output: a.ctid
Filter: (a.oid > (SubPlan 1))
Rows Removed by Filter: 660
Buffers: shared hit=5958
SubPlan 1
-> Aggregate (actual time=0.035..0.035 rows=1 loops=661)
Output: min(b.oid)
Buffers: shared hit=5949
-> Seq Scan on scott.t1 b (actual time=0.016..0.032 rows=1 loops=661)
Output: b.oid, b.relname, b.relnamespace
Filter: ((a.relname = b.relname) AND (a.relnamespace = b.relnamespace))
Rows Removed by Filter: 660
Buffers: shared hit=5949
Query Identifier: -3673588318904078456
Planning Time: 0.100 ms
Execution Time: 23.471 ms
Even though the query is returning the expected result, the execution plan is extremely suboptimal. PostgreSQL had to run the subquery 661 times(loops=661).
To be honest, as an Oracle guy for a long time I have used the above
query. And Oracle produces an efficient plan. If you run the query in
Oracle, the optimizer produces the following plan:
--------------------------------------------------
| Id | Operation | Name | Starts
| 0 | DELETE STATEMENT | | 1
| 1 | DELETE | T1 | 1 |
|* 2 | HASH JOIN | | 1 |
| 3 | TABLE ACCESS FULL | T1 | 1 |
| 4 | VIEW | VW_SQ_1 | 1 |
| 5 | SORT GROUP BY | | 1
| 6 | TABLE ACCESS FULL | T1 | 1
---------------------------------------------------
Unlike PostgreSQL which access the T1 table 662 times(1+661), Oracle access the T1 table just two times (Id=3, Id=6).
So I rewrote the query in PostgreSQL like the following:
I have re-run the following query to make a duplicate row.
insert into t1 select 2619,'pg_statistic',11;
DELETE
FROM T1 A
USING T1 B
WHERE a.relname = b.relname
AND a.relnamespace = b.relnamespace
AND A.OID > B.OID;
Delete on scott.t1 a (actual time=0.342..0.343 rows=0 loops=1)
Buffers: shared hit=19
-> Hash Join (actual time=0.330..0.331 rows=1 loops=1)
Output: a.ctid, b.ctid
Hash Cond: ((a.relname = b.relname) AND (a.relnamespace = b.relnamespace))
Join Filter: (a.oid > b.oid)
Rows Removed by Join Filter: 662
Buffers: shared hit=18
-> Seq Scan on scott.t1 a (actual time=0.017..0.058 rows=661 loops=1)
Output: a.ctid, a.relname, a.relnamespace, a.oid
Buffers: shared hit=9
-> Hash (actual time=0.158..0.159 rows=661 loops=1)
Output: b.ctid, b.relname, b.relnamespace, b.oid
Buckets: 1024 Batches: 1 Memory Usage: 81kB
Buffers: shared hit=9
-> Seq Scan on scott.t1 b (actual time=0.004..0.066 rows=661 loops=1)
Output: b.ctid, b.relname, b.relnamespace, b.oid
Buffers: shared hit=9
Query Identifier: -3693632511278988019
Planning Time: 0.127 ms
Execution Time: 0.362 ms
As
you can see, compared to the query using the analytic function, there
is not really a lot of difference in terms of the number of block IOs
between the two - 25 and 19 respectively. But there is an important
difference which shows up when you look at the run time - 2.536 ms and
0.363 ms respectively. And compared to the query using a correlated
subquery, it is safe to say that you have to find a way to remove correlated subqueries in a delete statment.
PostgreSQL is very poor at dealing with correlated subqueries. Looking
at the execution plan closely, we can predict that with the same amount
of data PostgreSQL would be faster than Oracle because PostgreSQL does
not have to do the aggregation(group by) operation.
Now you should know how to remove duplicate records in PostgreSQL.
Addendum
Poor
design leads to using a lot of storage. With regard to the sample ERD
at the beginning of this note I can not see any reason why an artifical
identifier is there. Disk space is cheap but not free. Database space
used has some serious multipliers like the following:
- Backups
- Replicas
- Development/Staging/Production Envronment
- WAL files
- Indexes
- Shared buffer cache in memory
Poor design costs performance and money!