未加星标

Find and Remove Duplicate Rows from a SQL Server Table

字体大小 | |
[数据库(mssql) 所属分类 数据库(mssql) | 发布者 店小二03 | 时间 2016 | 作者 红领巾 ] 0人收藏点击收藏

By:Sergey Gigoyan || Related Tips:More >Database Design

Problem

According to database design best practices , a SQL Server table should not contain duplicate rows. During the database design processprimary keys should be created to eliminate duplicate rows. However, sometimes we need to work with databases where these rules are not followed or exceptions are possible (when these rules are bypassed knowingly). For example, when a staging table is used and data is loaded from different sources where duplicate rows are possible. When the loading process completes, table should be cleaned or clean data should be loaded to a permanent table, so after that duplicates are no longer needed. Therefore an issue concerning the removal of duplicates from the loading table arises. In this tip let's examine some ways to solve data de-duplication needs.

Solution

We will consider two cases in this tip:

The first case is when a SQL Server table has a primary key (or unique index) and one of the columns contains duplicate values which should be removed. The second case is that table does not have a primary key or any unique indexes and contains duplicate rows which should be removed. Let's discuss these cases separately. Removing duplicates rows from a SQL Server table with a unique index Test Environment Setup

To accomplish our tasks we need a test environment:

USE master
GO
CREATE DATABASE TestDB
GO
USE TestDB
GO
CREATE TABLE TableA
(
ID INT NOT NULL IDENTITY(1,1),
Value INT,
CONSTRAINT PK_ID PRIMARY KEY(ID)
)

Now let's insert data into 'TableA':

USE TestDB
GO
INSERT INTO TableA(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)
SELECT *
FROM TableA
SELECT Value, COUNT(*) AS DuplicatesCount
FROM TableA
GROUP BY Value

As we can see the values 3 and 5 exists in the 'Value' column more than once:


Find and Remove Duplicate Rows from a SQL Server Table
Identify Duplicate Rows in the Table

Our task is to enforce uniqueness for the 'Value' column by removing duplicates. Removing duplicate values from table with a unique index is a bit easier than removing the rows from a table without it. First of all, we need to find duplicates. There are many different ways to do that. Let's investigate and compare some common ways. In the code below there are six solutions to find that duplicate values which should be deleted (leaving only one value):

----- Finding duplicate values in a table with a unique index
--Solution 1
SELECT a.*
FROM TableA a, (SELECT ID, (SELECT MAX(Value) FROM TableA i WHERE o.Value=i.Value GROUP BY Value HAVING o.ID < MAX(i.ID)) AS MaxValue FROM TableA o) b
WHERE a.ID=b.ID AND b.MaxValue IS NOT NULL
--Solution 2
SELECT a.*
FROM TableA a, (SELECT ID, (SELECT MAX(Value) FROM TableA i WHERE o.Value=i.Value GROUP BY Value HAVING o.ID=MAX(i.ID)) AS MaxValue FROM TableA o) b
WHERE a.ID=b.ID AND b.MaxValue IS NULL
--Solution 3
SELECT a.*
FROM
TableA a
INNER JOIN
(
SELECT MAX(ID) AS ID, Value
FROM TableA
GROUP BY Value
HAVING COUNT(Value) > 1
) b
ON a.ID < b.ID AND a.Value=b.Value
--Solution 4
SELECT a.*
FROM TableA a
WHERE ID < (SELECT MAX(ID) FROM TableA b WHERE a.Value=b.Value GROUP BY Value HAVING COUNT(*) > 1)
--Solution 5
SELECT a.*
FROM TableA a
INNER JOIN
(SELECT ID, RANK() OVER(PARTITION BY Value ORDER BY ID DESC) AS rnk FROM TableA ) b
ON a.ID=b.ID
WHERE b.rnk > 1
--Solution 6
SELECT * FROM TableA
WHERE ID NOT IN (SELECT MAX(ID) FROM TableA
GROUP BY Value)

As we can see the result for all cases is the same:


Find and Remove Duplicate Rows from a SQL Server Table

Only rows with ID=3, 5, 6 need to be deleted. Looking at the execution plan we can see that latest - the most 'compact' solution ('Solution 6') has a highest cost (in our example there is a primary key on the 'ID' column, so 'NULL' values are not possible for that column, therefore 'NOT IN' will work without any problem), and the second has the lowest cost:


Find and Remove Duplicate Rows from a SQL Server Table
Deleting Duplicate Rows

Now by using these queries, let's delete duplicate values from the table. To simplify our process we will use only the second, the fifth and the sixth queries:

USE TestDB
GO
--Initializing the table
TRUNCATE TABLE TableA
INSERT INTO TableA(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)
--Deleting duplicate values
DELETE t
FROM TableA t
WHERE ID IN ( SELECT a.ID FROM TableA a, (SELECT ID, (SELECT MAX(Value) FROM TableA i WHERE o.Value=i.Value GROUP BY Value HAVING o.ID=MAX(i.ID)) AS MaxValue FROM TableA o) b
WHERE a.ID=b.ID AND b.MaxValue IS NULL)
--Initializing the table
TRUNCATE TABLE TableA
INSERT INTO TableA(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)
--Deleting duplicate values
DELETE a
FROM TableA a
INNER JOIN
(SELECT ID, RANK() OVER(PARTITION BY Value ORDER BY ID DESC) AS rnk FROM TableA ) b
ON a.ID=b.ID
WHERE b.rnk>1
--Initializing the table
TRUNCATE TABLE TableA
INSERT INTO TableA(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)
--Deleting duplicate values
DELETE FROM TableA
WHERE ID NOT IN (SELECT MAX(ID) FROM TableA
GROUP BY Value)

Deleting the data and looking into the execution plans again we see that the fastest is the first DELETE command and the slowest is the last as expected:


Find and Remove Duplicate Rows from a SQL Server Table
Removing duplicates from table without unique index in ORACLE

As a means to help illustrate our final example in this tip, I want to explain some similar functionality in Oracle. Removing duplicate rows from the table without a unique index is a little easier in Oracle than in SQL Server. There is a ROWID pseudo column in Oracle which returns the address of the row. It uniquely identifies the row in the table (usually in the database also, but in this case there is an exception - if different tables store data in the same cluster they can have the same ROWID). The query below creates and inserts data into table in the Oracle database:

CREATE TABLE TableB (Value INT);
INSERT INTO TableB(Value) VALUES(1);
INSERT INTO TableB(Value) VALUES(2);
INSERT INTO TableB(Value) VALUES(3);
INSERT INTO TableB(Value) VALUES(4);
INSERT INTO TableB(Value) VALUES(5);
INSERT INTO TableB(Value) VALUES(5);
INSERT INTO TableB(Value) VALUES(3);
INSERT INTO TableB(Value) VALUES(5);

Now we are selecting the data and ROWID from the table:

SELECT ROWID, Value FROM TableB;

The result is below:


Find and Remove Duplicate Rows from a SQL Server Table

Now using ROWID we will easily remove duplicate rows from table:

DELETE TableB
WHERE rowid not in ( SELECT MAX(rowid) FROM TableB GROUP BY Value );

We can also remove duplicates using the code below:

DELETE from TableB o
WHERE rowid < ( SELECT MAX(rowid) FROM TableB i WHERE i.Value=o.Value GROUP BY Value ); Removing duplicates from a SQL Server table without unique index In SQL Server there is no equivalent to Oracle's ROWID, so to remove duplicates from the table without a unique index we need to do additional work for generating uniq

本文数据库(mssql)相关术语:熊片数据库 mssql数据库 oracle数据库 pubmed数据库 access数据库 万方数据库

主题: SQLSQL ServerTIRIMRY
分页:12
转载请注明
本文标题:Find and Remove Duplicate Rows from a SQL Server Table
本站链接:http://www.codesec.net/view/480649.html
分享请点击:


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 数据库(mssql) | 评论(0) | 阅读(50)