SQL finding duplicate values in rows-CodePudding

I have table with 22 columns and multiple rows.

I want to get a new table that will contain only rows for which the values in the columns V1, V3, V4, V13 appear twice in the table in correlation with each other (these repetitions)

Works in Microsoft SQL Server Management Studio.

I've tried something like this

SELECT 
    V1, V3, V4, V13
FROM  
    table 
GROUP BY 
    V1, V3, V4, V13
HAVING 
    (COUNT (*) = 2)

CodePudding user response：

You can do it by first counting how many values you have for each "V1", "V3", "V4", "V13" fields with the COUNT window function. Then join back with your original table, though keeping those rows which have count bigger than 1.

WITH cte AS (
    SELECT <your_table_identifier_field(s)>, 
           COUNT(*) OVER(PARTITION BY V1, V3, V4, V13) as cnt 
    FROM table
)
SELECT * 
FROM       tab
INNER JOIN cte
        ON tab.<your_table_identifier_field(s)> = cte.<your_table_identifier_field(s)>
       AND cte.cnt > 1

EDIT: What if there's no specific row identifier? You're forced to match all column values then.

WITH cte AS (
    SELECT *, 
           COUNT(*) OVER(PARTITION BY V1, V3, V4, V13) as cnt 
    FROM table
)
SELECT * 
FROM       tab
INNER JOIN cte
        ON tab.<field1> = cte.<field1>
       AND tab.<field2> = cte.<field2>
       AND ...
       AND cte.cnt > 1

CodePudding user response：

Why over-complicate things? Just take your original query, put it in a CTE and then join the CTE to the original table.

with cte as (
    SELECT V1, V3, V4, V13 FROM dbo.table 
    GROUP BY V1, V3, V4, V13
    HAVING COUNT (*) > 1
)
select t1.* 
from dbo.table as t1 
inner join cte on t1.V1 = cte.V1 and t1.V2 = cte.V2 ...
order by ...
;

I've made some changes to incorporate best practices. Added the schema name (assumed dbo) to your table name, statement terminator, ORDER BY clause (because it usually matters). I changed the HAVING clause since you asked for duplicates which means the count is > 1. Using count = 2 means you are limiting the result to those rows where there are exactly two duplicates. That is a very unusual requirement but change it if needed. Normally I would not use "*" as the column list as it is generally far better to explicitly include the columns you need. I leave that to you as well.

Finally, consider changing your schema to prevent duplicates. That will avoid a lot of future work.