Home > front end >  SQL Finding Dupe Names with Distinct ID
SQL Finding Dupe Names with Distinct ID

Time:11-13

I have a list of customers, and a bunch of them are duplicates ('Acme Inc', 'Acme, Inc', 'Acme Inc.', 'Acme, Inc.') They all have different IDs. BUT, each ID also has multiple addresses. Something like...

 ------- --------------- ------------------- ----------- --- ------- 
|ID     |Name           |Address            |City       |St |Zip    |
 ------- --------------- ------------------- ----------- --- ------- 
|001    |Acme Inc       |123 Address St     |Columbus   |OH |43081  |
|001    |Acme Inc       |321 Street St      |Columbus   |OH |43081  |
|001    |Acme Inc       |456 Blanket Blvd   |Columbus   |OH |43081  |
|002    |Acme, Inc      |123 Babel St       |Columbus   |OH |43081  |
|002    |Acme, Inc      |321 Acorn Rd       |Columbus   |OH |43081  |
|002    |Acme, Inc      |456 Lancer Blvd    |Columbus   |OH |43081  |
|003    |Baker          |456 Blanket Blvd   |Columbus   |OH |43081  |
|004    |Peterson       |456 Blanket Blvd   |Columbus   |OH |43081  |
|005    |Plumbers Inc   |123 Address St     |Columbus   |OH |43081  |
|006    |Plumbers, LLC  |321 Street St      |Columbus   |OH |43081  |
|007    |Acme, Inc.     |123 Address St     |Columbus   |OH |43081  |

I have a function that normalizes the name so the first 6 would all be 'Acme' and the last two 'Plumbers'.

What I want is a list of the ID and Name where there are duplicate. The goal is a report of records with unique IDs and duplicate names.

 ------- --------------- 
|ID     |Name           |
 ------- --------------- 
|001    |Acme Inc       |
|002    |Acme, Inc      |
|007    |Acme, Inc.     |
|005    |Plumbers Inc   |
|006    |Plumbers, LLC  |

I tried this:

SELECT 
    DISTINCT [Name],
    ( SELECT strNew FROM [fn_strNorm](2, [Name]) ) AS [NewName] 
  FROM [Processed_Vendors] 
  WHERE 
    [VendorID] <> '' AND 
    [VendorID] IS NOT NULL AND 
    [Name]<> '' AND 
    [Name] IS NOT NULL 
  GROUP BY [NewName]
  HAVING COUNT(*) > 1 
ORDER BY [NewName]

I also tried to put them into [dump_names] table and join the two, but I keep getting multiple records from the same ID

SELECT 
    pv.[VendorID], 
    pv.[Name]
  FROM [dupe_names] n 
  LEFT JOIN [Processed_Vendors] pv 
    ON pv.[Name] = n.[Name] 
ORDER BY pv.[Name]


SELECT 
    'Name Match' AS [Reason], 
    pv.[VendorID], 
    pv.[Name]
  FROM [dupe_names] n 
  LEFT JOIN [Processed_Vendors] pv 
    ON pv.[Name] = n.[Name] 
    AND ( SELECT strNew FROM [dbo].[fn_strNorm](2, pv.[Name]) ) = n.[NewName]
ORDER BY pv.[Name]

I think I'm overthinking this or the migraine I'm sporting is clouding my thinking. Either way, I appreciate the help.

CodePudding user response:

One approach would use a CTE to find the distinct counts of name variants for each normalized name. Then, join to your current table and retain only records which had more than one name variant.

WITH cte AS (
    SELECT [dbo].[fn_strNorm](2, Name) AS NmName, COUNT(DISTINCT Name) AS cnt
    FROM Processed_Vendors
    GROUP BY [dbo].[fn_strNorm](2, Name)
)

SELECT DISTINCT pv.ID, pv.Name
FROM Processed_Vendors pv
INNER JOIN cte t
    ON t.NmName = [dbo].[fn_strNorm](2, pv.Name)
WHERE t.cnt > 1;
  • Related