Home > Back-end >  Remove duplicates based on two columns and reversed values
Remove duplicates based on two columns and reversed values

Time:05-11

Let's say we have following table:

Select 
    'a' as col1,
    'b' as col2
union
Select 
    'a' as col1,
    'c' as col2
union
Select 
    'b' as col1,
    'a' as col2

It gives:

col1 col2
"a" "b"
"a" "c"
"b" "a"

How would you define duplicated row assuming that rows like ("a", "b") and ("b", "a") are the same.

So in fact result should be:

col1 col2
"a" "b"
"a" "c"

CodePudding user response:

judging by your sample data and results you could use a query like below

select 
distinct
case when t2.col1 is not null and t2.col1<t1.col1 then t2.col1 else t1.col1 end as col1,
case when t2.col2 is not null and t2.col2>t1.col2 then t2.col2 else t1.col2 end as col2
from yourtable t1 
left join yourtable t2
on t1.col1=t2.col2 and t1.col2=t2.col1 

CodePudding user response:

In Databricks, I can do something as follows and it works. It seems you need to have id to select distinct records by ignoring field order.

CREATE TABLE table_a AS
SELECT uuid() as id, col1, col2 FROM your_table;

SELECT a1.col1, a2.col2
FROM table_a a1
WHERE NOT EXISTS (SELECT 1
                  FROM table_a a2
                  WHERE a1.col1 = a2.col2 
                        AND a1.col2 = a2.col1
                        AND a1.id > a2.id);
  • Related