Home > Back-end >  Deduplication Logic to pick specific set of rows
Deduplication Logic to pick specific set of rows

Time:07-08

I have a problem statement, where I need to pick up rows based on a specific logic. For example this is what I have -

ID Amount Item
1 15 A
1 15 A
1 28 C
1 28 C
1 -15 B
2 30 D
2 -30 E
2 -30 E
2 18 F

So I first want to remove duplicated rows. this would get me -

ID Amount Item
1 15 A
1 28 C
1 -15 B
2 30 D
2 -30 E
2 18 F

Now, I want to pick rows along with the right "Item" where the amounts don't cancel each other off. For example, the output should be -

ID Amount Item
1 28 C
2 18 F

The logic I'm thinking of is using SUM OVER partitioned by the IDs, and then joining back to the table with the same ID and the SUMed amount. For ID 1, the sum amount would be 28, and for ID 2, it would be 18. I also need to get, Item C for ID 1, and Item F for ID 2. Does that make sense? Or does someone have a better idea? Also, how would I code this using SQL?

This should create the initial table -

CREATE TABLE table_1 (
    Id VARCHAR,
    Amount NUMERIC,
    Item VARCHAR
);

INSERT INTO table_1 (Id, Amount, Item)
VALUES 
('1', 15, 'A'),
('1', 15, 'A'),
('1', 28, 'C'),
('1', 28, 'C'),
('1', -15, 'B'),
('2', 30, 'D'),
('2', -30, 'E'),
('2', -30, 'E'),
('2', 18, 'F');

CodePudding user response:

There might be an easier way, but following your logic you can use row_number to identify duplicate rows and filter them out. Then you can use a windowed sum to identify rows which cancel and filter them out.

However I am sceptical about your logic, I mean how is it that you can remove duplicate rows like that? Maybe you have oversimplified the data? Because if it is valid in your actual data for a value to be repeated that shouldn't be duplicated or combined then the following logic would need to be improved to handle that.

with cte1 as (
    select *
        -- Use row number to remove duplicate rows
        , row_number() over (partition by Id, Amount, Item order by Id, Amount, Item) rn
    from #table_1
), cte2 as (
    select Id, Item
       -- Use windowed sum to identify rows which cancel out
       , sum(Amount) over (partition by Id, abs(Amount)) CancelSum
    from cte1
    where rn = 1
)
select Id, Item, CancelSum Amount
from cte2
-- Filter out cancelled rows
where CancelSum <> 0;

Returns:

Id Item Amount
1 C 28
2 F 18

CodePudding user response:

You can do your steps in a multiple CTEs and use them to find your wanted rows

WITH CTE AS (
    SELECT DISTINCT Id, Amount, Item
    FROM #table_1
), CTE2 AS (
    SELECT Id, SUM(Amount) Amount
    FROM CTE
    GROUP BY Id
    HAVING SUM(Amount)>0
)
SELECT Id, Amount, Item
FROM CTE 
WHERE EXISTS (
    SELECT 1
    FROM CTE2
    WHERE CTE.Id = CTE2.Id
    AND CTE.Amount = CTE2.Amount
)
Id | Amount | Item
:- | -----: | :---
1  |     28 | C   
2  |     18 | F   

db<>fiddle here

  • Related