Home > Mobile >  In SQL, how to collapse two rows into one?
In SQL, how to collapse two rows into one?

Time:01-07

Let's say I have the following table:

C1 C2 C3 C4
Alton James Webs AltonJamesWebs
Alton Webs Jams AltonJamsWebs
Buddarakh Izme Grill BuddarakhGrillIzme
Buddarakh Gri Izmezh BuddarakhGriIzmezh

How would I collapse the table based on the Column C1 so that the result looks like the following:

C1 C2_1 C3_1 C4_1 C2_2 C3_2 C4_2
Alton James Webs AltonJamesWebs Webs Jams AltonJamsWebs
Buddarakh Izme Grill BuddarakhGrillIzme Gri Izmezh BuddarakhGriIzmezh

The ultimate purpose of this is to calculate the Levensthein Distance between the strings in columns C4_1 and C4_2.

Thanks!

CodePudding user response:

Converting your example data into DDL/DML:

DECLARE @Table TABLE (C1 NVARCHAR(20), C2 NVARCHAR(20), C3 NVARCHAR(20), C4 NVARCHAR(20));
INSERT INTO @Table (C1, C2, C3, C4) VALUES
('Alton     ', 'James   ', 'Webs    ', 'AltonJamesWebs    '),
('Alton     ', 'Webs    ', 'Jams    ', 'AltonJamsWebs     '),
('Buddarakh ', 'Izme    ', 'Grill   ', 'BuddarakhGrillIzme'),
('Buddarakh ', 'Gri     ', 'Izmezh  ', 'BuddarakhGriIzmezh'),
('Buddarakh ', 'Gric    ', 'Izmezh  ', 'BuddarakhGriIzmezh');

We can perform a self-join, but first we'll want to assign some row numbers so we can keep track of the rows later:

;WITH nowWithRowNumber AS (
SELECT t.C1, t.C2, t.C3, t.C4, ROW_NUMBER() OVER (PARTITION BY C1 ORDER BY c2, c3, c4) AS rn
  FROM @Table t
)

SELECT t.C1, t.C2, t.c3, t.C4, t2.C2 AS C2_2, t2.C3 AS C3_2, t2.C4 AS C4_2, t2.rn
  FROM nowWithRowNumber t
    INNER JOIN nowWithRowNumber t2
      ON t.C1 = t2.C1
      AND t2.rn <> 1
      AND (
               t.c2 <> t2.c2
            OR t.c3 <> t2.c3
          ) 
 WHERE t.rn = 1
C1              C2          c3          C4                  C2_2        C3_2        C4_2                rn
----------------------------------------------------------------------------------------------------------
Alton           James       Webs        AltonJamesWebs      Webs        Jams        AltonJamsWebs       2
Buddarakh       Gri         Izmezh      BuddarakhGriIzmezh  Gric        Izmezh      BuddarakhGriIzmezh  2
Buddarakh       Gri         Izmezh      BuddarakhGriIzmezh  Izme        Grill       BuddarakhGrillIzme  3

This assumes the logic which you'll need to confirm or tune, that the rows should join based on the fact that the c1 columns match, but the others do not, and that the rows should be partition on c1 and sorted on c2, c3, c4.

CodePudding user response:

Trying to understand the wider context of your problem here, I think this is an x,y problem. In my experience when I have wanted to calculate the Levensthein Distance I have been attempting to find duplicate rows, and I've always wanted to do something with them once found. Pivoting them into columns actually makes any further processing very difficult. So I would approach this by keep the rows as they are, but matching them against the first duplicate in the C1 group found. This also handles as many potential duplicates as are found - although to be fair this is fairly simplistic logic.

DECLARE @Table TABLE (Id int, C1 nvarchar(20), C2 nvarchar(20), C3 nvarchar(20), C4 nvarchar(20));

INSERT INTO @Table (Id, C1, C2, C3, C4) VALUES
(1, 'Alton', 'James', 'Webs', 'AltonJamesWebs'),
(2, 'Alton', 'Webs', 'Jams', 'AltonJamsWebs'),
(3, 'Buddarakh', 'Izme', 'Grill', 'BuddarakhGrillIzme'),
(4, 'Buddarakh', 'Gri', 'Izmezh', 'BuddarakhGriIzmezh'),
(5, 'Buddarakh', 'Gric', 'Izmezh', 'BuddarakhGriIzmezh');

WITH cte1 AS (
    -- First find the row number within the C1 group
    SELECT *
        , ROW_NUMBER() OVER (PARTITION BY C1 ORDER BY Id) rn 
    FROM @Table
), cte2 AS (
    -- Second using lag for all but the first row, lag back using rn to the
    -- first row in the C1 group
    SELECT *
        , CASE WHEN rn > 1 THEN LAG(Id, rn-1, null) OVER (PARTITION BY C1 ORDER BY Id) ELSE NULL END baseId
        , CASE WHEN rn > 1 THEN LAG(C2, rn-1, null) OVER (PARTITION BY C1 ORDER BY Id) ELSE NULL END baseC2
        , CASE WHEN rn > 1 THEN LAG(C3, rn-1, null) OVER (PARTITION BY C1 ORDER BY Id) ELSE NULL END baseC3
        , CASE WHEN rn > 1 THEN LAG(C4, rn-1, null) OVER (PARTITION BY C1 ORDER BY Id) ELSE NULL END baseC4
    FROM cte1
)
SELECT Id
    , C1, C2, C3, C4
    , baseId, baseC2, baseC3, baseC4
    -- Some function to calculate Levensthein Distance
    , dbo.LevenstheinDistance(baseC4, C4) LevenstheinDistance
FROM cte2;

This returns:

Id C1 C2 C3 C4 baseId baseC2 baseC3 baseC4
1 Alton James Webs AltonJamesWebs null null null null
2 Alton Webs Jams AltonJamsWebs 1 James Webs AltonJamesWebs
3 Buddarakh Izme Grill BuddarakhGrillIzme null null null null
4 Buddarakh Gri Izmezh BuddarakhGriIzmezh 3 Izme Grill BuddarakhGrillIzme
5 Buddarakh Gric Izmezh BuddarakhGriIzmezh 3 Izme Grill BuddarakhGrillIzme

Which as you can see returns against each row (aside from the first in the group), the details of the first row in the group which can then be used to calculate the Levensthein Distance and then potentially to merge them because it knows which rows its being compared against.

DBFiddle

Thanks for the sample data Patrick Hurst.

  • Related