Home > Software engineering >  Delete duplicate based on one column based on another column's value
Delete duplicate based on one column based on another column's value

Time:01-19

I have a table with data like

OrderId Code Val
1 Code1 abc
2 Code2 def
3 Code1 aaa
3 Code2 bbb

Expected

OrderId Code Val
1 Code1 abc
2 Code2 def
3 Code2 bbb

Now, I want to get the OrderId in such a way that, OrderId should have only one Code value. If there are more than one Code value for an OrderId then Code2 takes priority. So my final result should look like, OrderId 3 should have only one record with Code2 as value, how to query this in T-SQL. I tried with following query but not sure how to proceed after that

select OrderId, Code, count(*)
from Table1
group by OrderId, Code

CodePudding user response:

You can use concept of ROW_NUMBER() OVER (PARTITION BY OrderId ORDER BY OrderCode desc) row_num

e.g.

select *
from (
    select *
        , row_number() over (partition by orderId order by ordercode desc) row_num
    from #OrderTemp
) Orders
where row_num = 1

CodePudding user response:

The answer to your original question before you edited it was this simple query with MAX with GROUP BY:

SELECT 
    OrderId, 
    MAX(Code) AS Code
FROM yourtable
GROUP BY OrderId
ORDER BY OrderId;

If according to your new requirements further columns should be selected, we could use the above query as subquery with JOIN:

SELECT 
    y.OrderId, 
    y.Code, 
    y.Val
FROM yourtable y
INNER JOIN (
    SELECT 
        OrderId, 
        MAX(Code) AS Code
    FROM yourtable
    GROUP BY OrderId
) AS sub ON y.OrderId = sub.OrderId
    AND y.Code = sub.Code
ORDER BY y.OrderId;

But this becomes long and bad to read. Therefore using a window function should be prefered.

But there is another possible issue which should be solved:

We should be careful with this simple MAX or ROW_NUMBER ideas if lots of codes per OrderId can appear because the above query will for example fetch Code2, not Code10 as highest code if both appear. That's because it's a string, not a number.

I guess that's not intended. We can fix this issue by finding out the highest number after the word code. So we could do something like this here, using SUBSTRING and ROW_NUMBER:

SELECT orderId, code, val
FROM (
  SELECT 
    orderId, code, val,
    ROW_NUMBER() OVER 
      (PARTITION BY orderId 
      ORDER BY CAST(SUBSTRING(Code,5,LEN(code)-4) AS INT) DESC) row_num
    FROM yourtable
) Orders
WHERE row_num = 1;

Thus, Code10 will be taken rather than Code2.

We can replicate these things here: db<>fiddle

The idea with the longer query is also shown in the fiddle, but as said this is unhandy and bad to read, so I don't recommend it.

  • Related