How to highlight differences between rows per key, per column in SQL?-CodePudding

Say I have a dataset that I know has a number of rows where the unique keys are the same but data on some other columns is different. I don't know which rows to choose in case of mismatch of data, so I will drop them all anyway, but I want to record this inconsistency before doing so. The format I have in mind is COL1: [KEY], on a per row basis, so that that I can later output a summary both on a column and a key level.

Using my limited knowledge of SQL I came up with this (SQL_Fiddle):

CREATE TABLE MyTable (
    ID1 INTEGER,
    ID2 INTEGER,
    COL1 TEXT,
    COL2 TEXT
);

#DUP ON KEY, MISMATCH ON COL1
INSERT INTO MyTable (ID1, ID2, COL1, COL2) VALUES (1, 2, 'A', 'A');
INSERT INTO MyTable (ID1, ID2, COL1, COL2) VALUES (1, 2, 'B', 'A');

#DUP ON KEY, MISMATCH ON (COL2, COL1, COL1 COL2)
INSERT INTO MyTable (ID1, ID2, COL1, COL2) VALUES (2, 3, 'C', 'C');
INSERT INTO MyTable (ID1, ID2, COL1, COL2) VALUES (2, 3, 'C', 'D');
INSERT INTO MyTable (ID1, ID2, COL1, COL2) VALUES (2, 3, 'X', 'C');
INSERT INTO MyTable (ID1, ID2, COL1, COL2) VALUES (2, 3, 'Y', 'Z');

#NORMAL ROWS, WOULD LIKE TO KEEP
INSERT INTO MyTable (ID1, ID2, COL1, COL2) VALUES (4, 5, 'G', 'J');
INSERT INTO MyTable (ID1, ID2, COL1, COL2) VALUES (6, 7, 'H', 'Y');

SELECT T1.ID1, T1.ID2, T1.COL1, T1.COL2,
CASE
WHEN T1.COL1 <> T2.COL1 THEN CONCAT('COL1: ', T1.ID1, ' ', T1.ID2)
WHEN T1.COL2 <> T2.COL2 THEN CONCAT('COL2: ', T1.ID1, ' ', T1.ID2) END AS DIF_ON_KEY
#WHEN T1.COL3 <> T2.COL3...TO BE CONTINUED WITH PYTHON
from MyTable T1
INNER JOIN MyTable T2
ON T1.ID1 = T2.ID1
AND T1.ID2 = T2.ID2
GROUP BY ID1, ID2, COL1, COL2

The query will be generated dynamically using python, so I realize it will get huge with larger tables. I'm wondering if there's a more efficient/elegant way of doing this, provided I can parse the keys and columns before query creation. Currently this does skip the first row when labeling, so I'd have to add 1 to the count if I want a count of data-issue per column. Also multi-column inconsistencies aren't reflected too well, e.g key (2, 3) with values ('Y', 'Z') only shows as different on col1, not both. Also, this will run on Snowflake, not sure if there's any Snowflake-specific function that can help me out here. I'm guessing there has to be a less cumbersome way to do this, thanks in advance for any advice.

CodePudding user response：

Since you'll be using this with an arbitrary number of columns, I'd lean toward dynamic construction in a Snowflake query. There's only one function that does that and preserves the column names, and that's object_construct. You can use object_construct to create objects from the rows (as long as the overall size including column names is less than 16mb). From there you can use a Javascript UDF to do an object diff:

-- Note: This only works for flat objects with the same keys in both objects
create or replace function OBJECT_DIFF("obj1" object, "obj2" object, "tag1" string, "tag2" string)
returns object
language javascript
as
$$

var objOut = {};
var diffObj = {};

for ( var i in obj1 )
{
     if(obj1[i] !== obj2[i]) {
        diffObj = {};
        diffObj[tag1] = obj1[i];
        diffObj[tag2] = obj2[i];
        objOut[i] = diffObj;
     }
}
  return objOut;
$$;


-- Construct objects:
with KEY_COUNTS as
(
select ID1, ID2, count(*) as ROW_COUNT from MYTABLE group by ID1, ID2 having ROW_COUNT > 1
), DUPES as 
(
select T.* from KEY_COUNTS K left join MYTABLE T on T.ID1 = K.ID1 and T.ID2 = K.ID2
)
select T1.ID1, T2.ID2, object_construct(T1.*) T1, object_construct(T2.*) T2, object_diff(T1, T2, 'T1', 'T2') as column_diffs
from DUPES T1 join DUPES T2 on T1.ID1 = T2.ID1 and T1.ID2 = T2.ID2
;