How to list duplicates based on different criteria's T-SQL-CodePudding

I'm looking for someone to help me with a very specific task I have.

I'm analysing data from computer hard drives and need to be able to list folders which are duplicated after being extracted from .zip files. Here is an example of the data I am working with:

ItemName	Extension	ItemType
MyZipFolder.zip	.zip	File
MyZipFolder	null	Folder
PersonalDocuments.zip	.zip	File
PersonalDocuments	null	Folder

As you can see the extension '.zip' is included in the 'ItemName' and 'Extension' column. When extracted from a .zip file, it becomes a folder. I need a way of listing either the .zip file or the folder which it becomes after extraction (either will do, it just needs to be listed with the knowledge that it is a duplicate).

The caveat to this is that my data consists of plenty other folders and files with different extensions e.g. '.docx', '.msg' so the query needs to discount these.

I hope this makes sense - thanks!

Expected output might look something like this:

ItemName	Extension	ItemType
MyZipFolder	null	Folder
PersonalDocuments	null	Folder

So a list of all the folders which I know have a .zip equivalent in the data.

CodePudding user response：

Not sure yet, but do you mean something like this?

select *
from your_table y
where ItemType = 'Folder'
  and exists (
    select 1 from your_table yy
    where yy.Extension = '.zip'
      and replace(yy.ItemName,'.zip','') = y.ItemName
  )

CodePudding user response：

I think I got what you need :

select ItemName 
from tablename
group by replace(ItemName, isnull(Extension,''))
having case count(case when Extension = '.zip' then 1 end) > 1