Group by a column and ignore whitespace and special characters?-CodePudding

I am trying to find records in my database (SQL Server) that have values in common. I am grouping them and counting them as below:

select count(ItemNum) as RecordCount, Vendor, PartNumber 
from Products
where (RecordCount > 2)
group by Vendor, PartNumber
order by RecordCount desc, Vendor, PartNumber

I want the result table to have all existing combinations of Vendor and PartNumber and look like RecordCount, Vendor, PartNumber

This works great, but sometimes there are minor differences in the values and I want to group them together despite those differences. Specifically I want to ignore whitespace and non-alphanumeric characters.

For example, PartNumber = "p120" should match with "p1_20" and "p1 20".

How can I do this?

CodePudding user response：

Try to use string functions to remove whitespaces. Here is list below:

LTRIM RTRIM REPLACE

These functions will remove / replace extra chars. For example:

-- PartNumber = " 120_1 "
SELECT REPLACE(LTRIM(RTRIM(PartNumber)), '_', '') FROM Products
-- PartNumber = "1201"

I created a table with test data:

INSERT INTO Products (ItemNum, Vendor, PartNumber) VALUES (25, 'VendorName', '1025');
INSERT INTO Products (ItemNum, Vendor, PartNumber) VALUES (14, 'VendorName', '1_ 025');
INSERT INTO Products (ItemNum, Vendor, PartNumber) VALUES (25, 'VendorName', '102 5');
INSERT INTO Products (ItemNum, Vendor, PartNumber) VALUES (25, 'VendorName', '102_5');
INSERT INTO Products (ItemNum, Vendor, PartNumber) VALUES (25, 'VendorName', ' 1025 ');
INSERT INTO Products (ItemNum, Vendor, PartNumber) VALUES (25, 'VendorName', '102 5');
INSERT INTO Products (ItemNum, Vendor, PartNumber) VALUES (223, 'OtherVendorName', '    9_02');
INSERT INTO Products (ItemNum, Vendor, PartNumber) VALUES (102, 'OtherVendorName', '902');
INSERT INTO Products (ItemNum, Vendor, PartNumber) VALUES (255, 'OtherVendorName', '  902 ');
INSERT INTO Products (ItemNum, Vendor, PartNumber) VALUES (31, 'OtherVendorName', '902_');
INSERT INTO Products (ItemNum, Vendor, PartNumber) VALUES (33, 'OtherVendorName', '9 02');
INSERT INTO Products (ItemNum, Vendor, PartNumber) VALUES (55, 'OtherVendorName', '  902  ');

Complete query with data from the previous code block.

SELECT COUNT(ItemNum) AS RecordCount, Vendor, REPLACE(REPLACE(LTRIM(RTRIM(PartNumber)), '_', ''), ' ', '') AS PartNumber
FROM Products
GROUP BY Vendor, REPLACE(REPLACE(LTRIM(RTRIM(PartNumber)), '_', ''), ' ', '')
HAVING COUNT(ItemNum) > 2
ORDER BY COUNT(ItemNum) DESC, 2, 3

Returns

RecordCount	Vendor	PartNumber
6	OtherVendorName	902
6	VendorName	1025

CodePudding user response：

Assuming you are evaluating only numbers in those part numbers and letters are ignored...

SELECT COUNT(p2.PartNumber) RecordCount, p1.Vendor, p1.PartNumber 
FROM Products p1
LEFT JOIN Products p2 ON LEFT(SUBSTRING(p1.PartNumber, PATINDEX('%[0-9]%', p1.PartNumber), 4000),PATINDEX('%[^0-9]%', SUBSTRING(p1.PartNumber, PATINDEX('%[0-9]%', p1.PartNumber), 4000)))
= LEFT(SUBSTRING(p1.PartNumber, PATINDEX('%[0-9]%', p2.PartNumber), 4000),PATINDEX('%[^0-9]%', SUBSTRING(p2.PartNumber, PATINDEX('%[0-9]%', p2.PartNumber), 4000)))
GROUP BY p1.Vendor, p1.PartNumber,
HAVING COUNT(p2.PartNumber) > 2
ORDER BY COUNT(p2.PartNumber) DESC, Vendor, PartNumber

I just set the substring with a max for nvarchar since I don't know your data types.