I have a big table with over 1 million rows and 96 columns.
Using SQL I want to find rows where every value is the same. The table doesn't have any primary key so I'm not sure how to approach this. I'm not permitted to change the table structure.
I've seen people use count(*) and group by but I'm not sure if this is effective for a table with 96 columns.
CodePudding user response:
Using COUNT()
as an analytic function we can try:
WITH cte AS (
SELECT *, COUNT(*) OVER (PARTITION BY col1, col2, ..., col96) cnt
FROM yourTable
)
SELECT col1, col2, ..., col96
FROM cte
WHERE cnt > 1;
CodePudding user response:
you can use md5 function as primary key.
select count(1),md5_col,* from (
select md5(concat_ws('',col1,col2)) as md5_col,* from db_name.table_name) tt group by md5_col;
CodePudding user response:
Rather than trying to boil the ocean and solve the entire problem with a single sql query (which you certainly can do...), I recommend using any indexes or statistics on the tables to filter out as many rows as you can.
Start by finding the columns with the most / fewest unique values (assuming you have statistics, that is), and smash them up against each other to rapidly exclude as many rows as possible. Take the results, dump them to a temp table, index fields as needed, and repeat.
Or you could just do this:
Declare @sql nvarchar(max);
Select @sql='select column1 from schema.table where case ' stuff((select 'when col1!=' quotename(name) ' then 0 ' from sys.columns where object_id=object_id('schema.table') for xml path(''),Type).value('.','nvarchar(max)'),1,11,'') 'else 1 end = 1';
Exec sp_executesql @sql;
If you must run that horrorshow of a query in production, please use snapshot isolation or move it to a temp table first (unless no one ever updates the table.
(Honestly, I would probably use something like that query on the temp table containing my filtered-down dataset... bit anything you c!n do to makes sure that the comparisons aren't naive (e.g. taking statistics into account) can improve your performance significantly. If you want to do it all at once, you could always join sys.tables to a temp table that puts your field comparisons into a thoughtful order. After all, once a case statement if found to be true, all the others will be skipped for that record. )
CodePudding user response:
For convenience, use BINARY_CHECKSUM
:
with cte as (
select *, BINARY_CHECKSUM(*) checksum
from mytable
), cte2 as (
select checksum
from cte
group by checksum
having count(*) > 1
)
select distinct t1.*
from cte t1
join cte t2 on t1.checksum = t2.checksum
and col1 = col2
and col2 = col2
-- etc
where t1.checksum in (select checksum from cte2)
cte2
will return (almost) only truly matching rows, so join condition won't have many rows to exhaustively compare every column.