In SQL, how to collapse two rows into one?-CodePudding

Let's say I have the following table:

C1	C2	C3	C4
Alton	James	Webs	AltonJamesWebs
Alton	Webs	Jams	AltonJamsWebs
Buddarakh	Izme	Grill	BuddarakhGrillIzme
Buddarakh	Gri	Izmezh	BuddarakhGriIzmezh

How would I collapse the table based on the Column C1 so that the result looks like the following:

C1	C2_1	C3_1	C4_1	C2_2	C3_2	C4_2
Alton	James	Webs	AltonJamesWebs	Webs	Jams	AltonJamsWebs
Buddarakh	Izme	Grill	BuddarakhGrillIzme	Gri	Izmezh	BuddarakhGriIzmezh

The ultimate purpose of this is to calculate the Levensthein Distance between the strings in columns C4_1 and C4_2.

Thanks!

CodePudding user response：

Converting your example data into DDL/DML:

DECLARE @Table TABLE (C1 NVARCHAR(20), C2 NVARCHAR(20), C3 NVARCHAR(20), C4 NVARCHAR(20));
INSERT INTO @Table (C1, C2, C3, C4) VALUES
('Alton     ', 'James   ', 'Webs    ', 'AltonJamesWebs    '),
('Alton     ', 'Webs    ', 'Jams    ', 'AltonJamsWebs     '),
('Buddarakh ', 'Izme    ', 'Grill   ', 'BuddarakhGrillIzme'),
('Buddarakh ', 'Gri     ', 'Izmezh  ', 'BuddarakhGriIzmezh'),
('Buddarakh ', 'Gric    ', 'Izmezh  ', 'BuddarakhGriIzmezh');

We can perform a self-join, but first we'll want to assign some row numbers so we can keep track of the rows later:

;WITH nowWithRowNumber AS (
SELECT t.C1, t.C2, t.C3, t.C4, ROW_NUMBER() OVER (PARTITION BY C1 ORDER BY c2, c3, c4) AS rn
  FROM @Table t
)

SELECT t.C1, t.C2, t.c3, t.C4, t2.C2 AS C2_2, t2.C3 AS C3_2, t2.C4 AS C4_2, t2.rn
  FROM nowWithRowNumber t
    INNER JOIN nowWithRowNumber t2
      ON t.C1 = t2.C1
      AND t2.rn <> 1
      AND (
               t.c2 <> t2.c2
            OR t.c3 <> t2.c3
          ) 
 WHERE t.rn = 1

C1              C2          c3          C4                  C2_2        C3_2        C4_2                rn
----------------------------------------------------------------------------------------------------------
Alton           James       Webs        AltonJamesWebs      Webs        Jams        AltonJamsWebs       2
Buddarakh       Gri         Izmezh      BuddarakhGriIzmezh  Gric        Izmezh      BuddarakhGriIzmezh  2
Buddarakh       Gri         Izmezh      BuddarakhGriIzmezh  Izme        Grill       BuddarakhGrillIzme  3

This assumes the logic which you'll need to confirm or tune, that the rows should join based on the fact that the c1 columns match, but the others do not, and that the rows should be partition on c1 and sorted on c2, c3, c4.

CodePudding user response：

Trying to understand the wider context of your problem here, I think this is an x,y problem. In my experience when I have wanted to calculate the Levensthein Distance I have been attempting to find duplicate rows, and I've always wanted to do something with them once found. Pivoting them into columns actually makes any further processing very difficult. So I would approach this by keep the rows as they are, but matching them against the first duplicate in the C1 group found. This also handles as many potential duplicates as are found - although to be fair this is fairly simplistic logic.

DECLARE @Table TABLE (Id int, C1 nvarchar(20), C2 nvarchar(20), C3 nvarchar(20), C4 nvarchar(20));

INSERT INTO @Table (Id, C1, C2, C3, C4) VALUES
(1, 'Alton', 'James', 'Webs', 'AltonJamesWebs'),
(2, 'Alton', 'Webs', 'Jams', 'AltonJamsWebs'),
(3, 'Buddarakh', 'Izme', 'Grill', 'BuddarakhGrillIzme'),
(4, 'Buddarakh', 'Gri', 'Izmezh', 'BuddarakhGriIzmezh'),
(5, 'Buddarakh', 'Gric', 'Izmezh', 'BuddarakhGriIzmezh');

WITH cte1 AS (
    -- First find the row number within the C1 group
    SELECT *
        , ROW_NUMBER() OVER (PARTITION BY C1 ORDER BY Id) rn 
    FROM @Table
), cte2 AS (
    -- Second using lag for all but the first row, lag back using rn to the
    -- first row in the C1 group
    SELECT *
        , CASE WHEN rn > 1 THEN LAG(Id, rn-1, null) OVER (PARTITION BY C1 ORDER BY Id) ELSE NULL END baseId
        , CASE WHEN rn > 1 THEN LAG(C2, rn-1, null) OVER (PARTITION BY C1 ORDER BY Id) ELSE NULL END baseC2
        , CASE WHEN rn > 1 THEN LAG(C3, rn-1, null) OVER (PARTITION BY C1 ORDER BY Id) ELSE NULL END baseC3
        , CASE WHEN rn > 1 THEN LAG(C4, rn-1, null) OVER (PARTITION BY C1 ORDER BY Id) ELSE NULL END baseC4
    FROM cte1
)
SELECT Id
    , C1, C2, C3, C4
    , baseId, baseC2, baseC3, baseC4
    -- Some function to calculate Levensthein Distance
    , dbo.LevenstheinDistance(baseC4, C4) LevenstheinDistance
FROM cte2;

This returns:

Id	C1	C2	C3	C4	baseId	baseC2	baseC3	baseC4
1	Alton	James	Webs	AltonJamesWebs	null	null	null	null
2	Alton	Webs	Jams	AltonJamsWebs	1	James	Webs	AltonJamesWebs
3	Buddarakh	Izme	Grill	BuddarakhGrillIzme	null	null	null	null
4	Buddarakh	Gri	Izmezh	BuddarakhGriIzmezh	3	Izme	Grill	BuddarakhGrillIzme
5	Buddarakh	Gric	Izmezh	BuddarakhGriIzmezh	3	Izme	Grill	BuddarakhGrillIzme

Which as you can see returns against each row (aside from the first in the group), the details of the first row in the group which can then be used to calculate the Levensthein Distance and then potentially to merge them because it knows which rows its being compared against.

DBFiddle

Thanks for the sample data Patrick Hurst.