Home > database >  Remove subset rows of other rows in SQL
Remove subset rows of other rows in SQL

Time:04-12

I've this table below that includes ID, and five indicator columns: x1, ..., x5:

enter image description here

I need to remove duplicates based on this logic:

For each ID, we look at the values of x1, ..., x5, and we remove the ones that are subset of other row. For example, for ID=1, row #3 is a subset of row #2, so we remove row #3. Also, row #4 is NOT a subset of row #2, so we keep it.

Here is the expected final view of the table:

enter image description here

CodePudding user response:

First concatenate all values of the 5 columns so that you get a binary string like '100101' which can be converted to a base 10 number, say aliased value, with the function CONV().

Assuming there are no duplicate rows in the table as you mention in your comment, a row #X should be considered a subset of another row #Y if the result of the bitwise OR between the values of the 2 rows returns the value of #Y:

SELECT t1.*
FROM tablename t1
WHERE NOT EXISTS (
  SELECT *
  FROM tablename t2
  WHERE t2.ID = t1.ID 
    AND (t1.x1, t1.x2, t1.x3, t1.x4, t1.x5) <>
        (t2.x1, t2.x2, t2.x3, t2.x4, t2.x5)
    AND CONV(CONCAT(t1.x1, t1.x2, t1.x3, t1.x4, t1.x5), 2, 10) |
        CONV(CONCAT(t2.x1, t2.x2, t2.x3, t2.x4, t2.x5), 2, 10) =
        CONV(CONCAT(t2.x1, t2.x2, t2.x3, t2.x4, t2.x5), 2, 10)
);

Or, for MySql 8.0 :

WITH cte AS (
  SELECT *, CONV(CONCAT(x1, x2, x3, x4, x5), 2, 10) value
  FROM tablename
)
SELECT t1.ID, t1.x1, t1.x2, t1.x3, t1.x4, t1.x5
FROM cte t1
WHERE NOT EXISTS (
  SELECT *
  FROM cte t2
  WHERE t2.ID = t1.ID 
    AND t2.value <> t1.value
    AND t1.value | t2.value = t2.value
);

If you want to delete the subset rows, use a self join of the table like this:

DELETE t1
FROM tablename t1 INNER JOIN tablename t2
ON t2.ID = t1.ID
AND (t1.x1, t1.x2, t1.x3, t1.x4, t1.x5) <>
    (t2.x1, t2.x2, t2.x3, t2.x4, t2.x5)
AND CONV(CONCAT(t1.x1, t1.x2, t1.x3, t1.x4, t1.x5), 2, 10) |
    CONV(CONCAT(t2.x1, t2.x2, t2.x3, t2.x4, t2.x5), 2, 10) =
    CONV(CONCAT(t2.x1, t2.x2, t2.x3, t2.x4, t2.x5), 2, 10);

See the demo.

  • Related