Let's say I have the following table:
C1 | C2 | C3 | C4 |
---|---|---|---|
Alton | James | Webs | AltonJamesWebs |
Alton | Webs | Jams | AltonJamsWebs |
Buddarakh | Izme | Grill | BuddarakhGrillIzme |
Buddarakh | Gri | Izmezh | BuddarakhGriIzmezh |
How would I collapse the table based on the Column C1 so that the result looks like the following:
C1 | C2_1 | C3_1 | C4_1 | C2_2 | C3_2 | C4_2 |
---|---|---|---|---|---|---|
Alton | James | Webs | AltonJamesWebs | Webs | Jams | AltonJamsWebs |
Buddarakh | Izme | Grill | BuddarakhGrillIzme | Gri | Izmezh | BuddarakhGriIzmezh |
The ultimate purpose of this is to calculate the Levensthein Distance between the strings in columns C4_1 and C4_2.
Thanks!
CodePudding user response:
Converting your example data into DDL/DML:
DECLARE @Table TABLE (C1 NVARCHAR(20), C2 NVARCHAR(20), C3 NVARCHAR(20), C4 NVARCHAR(20));
INSERT INTO @Table (C1, C2, C3, C4) VALUES
('Alton ', 'James ', 'Webs ', 'AltonJamesWebs '),
('Alton ', 'Webs ', 'Jams ', 'AltonJamsWebs '),
('Buddarakh ', 'Izme ', 'Grill ', 'BuddarakhGrillIzme'),
('Buddarakh ', 'Gri ', 'Izmezh ', 'BuddarakhGriIzmezh'),
('Buddarakh ', 'Gric ', 'Izmezh ', 'BuddarakhGriIzmezh');
We can perform a self-join, but first we'll want to assign some row numbers so we can keep track of the rows later:
;WITH nowWithRowNumber AS (
SELECT t.C1, t.C2, t.C3, t.C4, ROW_NUMBER() OVER (PARTITION BY C1 ORDER BY c2, c3, c4) AS rn
FROM @Table t
)
SELECT t.C1, t.C2, t.c3, t.C4, t2.C2 AS C2_2, t2.C3 AS C3_2, t2.C4 AS C4_2, t2.rn
FROM nowWithRowNumber t
INNER JOIN nowWithRowNumber t2
ON t.C1 = t2.C1
AND t2.rn <> 1
AND (
t.c2 <> t2.c2
OR t.c3 <> t2.c3
)
WHERE t.rn = 1
C1 C2 c3 C4 C2_2 C3_2 C4_2 rn
----------------------------------------------------------------------------------------------------------
Alton James Webs AltonJamesWebs Webs Jams AltonJamsWebs 2
Buddarakh Gri Izmezh BuddarakhGriIzmezh Gric Izmezh BuddarakhGriIzmezh 2
Buddarakh Gri Izmezh BuddarakhGriIzmezh Izme Grill BuddarakhGrillIzme 3
This assumes the logic which you'll need to confirm or tune, that the rows should join based on the fact that the c1 columns match, but the others do not, and that the rows should be partition on c1 and sorted on c2, c3, c4.
CodePudding user response:
Trying to understand the wider context of your problem here, I think this is an x,y problem. In my experience when I have wanted to calculate the Levensthein Distance I have been attempting to find duplicate rows, and I've always wanted to do something with them once found. Pivoting them into columns actually makes any further processing very difficult. So I would approach this by keep the rows as they are, but matching them against the first duplicate in the C1 group found. This also handles as many potential duplicates as are found - although to be fair this is fairly simplistic logic.
DECLARE @Table TABLE (Id int, C1 nvarchar(20), C2 nvarchar(20), C3 nvarchar(20), C4 nvarchar(20));
INSERT INTO @Table (Id, C1, C2, C3, C4) VALUES
(1, 'Alton', 'James', 'Webs', 'AltonJamesWebs'),
(2, 'Alton', 'Webs', 'Jams', 'AltonJamsWebs'),
(3, 'Buddarakh', 'Izme', 'Grill', 'BuddarakhGrillIzme'),
(4, 'Buddarakh', 'Gri', 'Izmezh', 'BuddarakhGriIzmezh'),
(5, 'Buddarakh', 'Gric', 'Izmezh', 'BuddarakhGriIzmezh');
WITH cte1 AS (
-- First find the row number within the C1 group
SELECT *
, ROW_NUMBER() OVER (PARTITION BY C1 ORDER BY Id) rn
FROM @Table
), cte2 AS (
-- Second using lag for all but the first row, lag back using rn to the
-- first row in the C1 group
SELECT *
, CASE WHEN rn > 1 THEN LAG(Id, rn-1, null) OVER (PARTITION BY C1 ORDER BY Id) ELSE NULL END baseId
, CASE WHEN rn > 1 THEN LAG(C2, rn-1, null) OVER (PARTITION BY C1 ORDER BY Id) ELSE NULL END baseC2
, CASE WHEN rn > 1 THEN LAG(C3, rn-1, null) OVER (PARTITION BY C1 ORDER BY Id) ELSE NULL END baseC3
, CASE WHEN rn > 1 THEN LAG(C4, rn-1, null) OVER (PARTITION BY C1 ORDER BY Id) ELSE NULL END baseC4
FROM cte1
)
SELECT Id
, C1, C2, C3, C4
, baseId, baseC2, baseC3, baseC4
-- Some function to calculate Levensthein Distance
, dbo.LevenstheinDistance(baseC4, C4) LevenstheinDistance
FROM cte2;
This returns:
Id | C1 | C2 | C3 | C4 | baseId | baseC2 | baseC3 | baseC4 |
---|---|---|---|---|---|---|---|---|
1 | Alton | James | Webs | AltonJamesWebs | null | null | null | null |
2 | Alton | Webs | Jams | AltonJamsWebs | 1 | James | Webs | AltonJamesWebs |
3 | Buddarakh | Izme | Grill | BuddarakhGrillIzme | null | null | null | null |
4 | Buddarakh | Gri | Izmezh | BuddarakhGriIzmezh | 3 | Izme | Grill | BuddarakhGrillIzme |
5 | Buddarakh | Gric | Izmezh | BuddarakhGriIzmezh | 3 | Izme | Grill | BuddarakhGrillIzme |
Which as you can see returns against each row (aside from the first in the group), the details of the first row in the group which can then be used to calculate the Levensthein Distance and then potentially to merge them because it knows which rows its being compared against.
Thanks for the sample data Patrick Hurst.