I've this table below that includes ID
, and five indicator columns: x1
, ..., x5
:
I need to remove duplicates based on this logic:
For each ID
, we look at the values of x1
, ..., x5
, and we remove the ones that are subset of other row. For example, for ID=1, row #3 is a subset of row #2, so we remove row #3. Also, row #4 is NOT a subset of row #2, so we keep it.
Here is the expected final view of the table:
CodePudding user response:
First concatenate all values of the 5 columns so that you get a binary string like '100101' which can be converted to a base 10 number, say aliased value
, with the function CONV()
.
Assuming there are no duplicate rows in the table as you mention in your comment, a row #X should be considered a subset of another row #Y if the result of the bitwise OR
between the value
s of the 2 rows returns the value
of #Y:
SELECT t1.*
FROM tablename t1
WHERE NOT EXISTS (
SELECT *
FROM tablename t2
WHERE t2.ID = t1.ID
AND (t1.x1, t1.x2, t1.x3, t1.x4, t1.x5) <>
(t2.x1, t2.x2, t2.x3, t2.x4, t2.x5)
AND CONV(CONCAT(t1.x1, t1.x2, t1.x3, t1.x4, t1.x5), 2, 10) |
CONV(CONCAT(t2.x1, t2.x2, t2.x3, t2.x4, t2.x5), 2, 10) =
CONV(CONCAT(t2.x1, t2.x2, t2.x3, t2.x4, t2.x5), 2, 10)
);
Or, for MySql 8.0 :
WITH cte AS (
SELECT *, CONV(CONCAT(x1, x2, x3, x4, x5), 2, 10) value
FROM tablename
)
SELECT t1.ID, t1.x1, t1.x2, t1.x3, t1.x4, t1.x5
FROM cte t1
WHERE NOT EXISTS (
SELECT *
FROM cte t2
WHERE t2.ID = t1.ID
AND t2.value <> t1.value
AND t1.value | t2.value = t2.value
);
If you want to delete the subset rows, use a self join of the table like this:
DELETE t1
FROM tablename t1 INNER JOIN tablename t2
ON t2.ID = t1.ID
AND (t1.x1, t1.x2, t1.x3, t1.x4, t1.x5) <>
(t2.x1, t2.x2, t2.x3, t2.x4, t2.x5)
AND CONV(CONCAT(t1.x1, t1.x2, t1.x3, t1.x4, t1.x5), 2, 10) |
CONV(CONCAT(t2.x1, t2.x2, t2.x3, t2.x4, t2.x5), 2, 10) =
CONV(CONCAT(t2.x1, t2.x2, t2.x3, t2.x4, t2.x5), 2, 10);
See the demo.