Home > database >  MySQL GROUP BY all columns except of one
MySQL GROUP BY all columns except of one

Time:10-11

I need to compare two tables with exactly the same fields/types and identify all records which are different in at least one field or do only exist in one of the two tables.

Both tables have an column id with a unique value for each record. This would allow to identify corresponding records in both tables. But it does not mean, that those records are identical (any of the other columns can be different between the two records).

table0:

| id | col1 | col2 | ... |
| -- | ---- | ---- | --- |
| 22 | 1111 | 'gt' | ... |
| 23 | 5624 | 'ha' | ... |
| 24 | 7775 | 'oh' | ... |
| 26 | 2113 | 'yh' | ... |
| 28 | 9988 | 'wq' | ... |

table1:

| id | col1 | col2 | ... |
| -- | ---- | ---- | --- |
| 22 | 1111 | 'gt' | ... |
| 23 | 5624 | 'ha' | ... |
| 25 | 3333 | 'er' | ... |
| 26 | 2113 | 'ya' | ... |
| 28 | 9988 | 'wq' | ... |

What I need is the following result/output:

| id | reason |
| -- | ------ |
| 24 | only in table0 |
| 25 | only in table1 |
| 26 | not identical values |

A simple way would obviously be

SELECT *
    FROM
    (
        SELECT *, 0 AS src /* added src to identify table 0 */
        FROM table0
        UNION ALL
        SELECT *, 1 AS src /* added src to identify table 1 */
        FROM table1
    )  temp
    GROUP BY col1, col2, col3, ...
    HAVING COUNT(*) = 1

But there are two problems with this solution:

  1. I do not know the full set of columns at the time of writing this code. The table specification does change over time (the two tables are created by reading csv files, though the two tables to be compared are always identical). Is there a way to group by something like '*'?

  2. If grouping is indeed possible in a way I described above, I would need to exclude the src value I have added to identify the table a resulting record comes from.

CodePudding user response:

Your last edit makes it such a different question that it may be worth raising a new question and deleting this one that way some of the experienced contributors may revisit it.

Another approach would be to hash all the columns and compare the has values , you will still need to build the query from information_schema.columns but at least you don't need to match every column individually. Here's an example

    DROP TABLE IF EXISTS T,T1;
    
    CREATE TABLE T
    ( id INT, col1 INT, col2 VARCHAR(2));
    INSERT INTO T VALUES
    ( 22 , 1111 , 'gt'  ),
    ( 23 , 5624 , 'ha'  ),
    ( 24 , 7775 , 'oh'  ),
    ( 26 , 2113 , 'yh'  ),
    ( 28 , 9988 , 'wq'  );
    
    CREATE TABLE T1
    ( id INT, col1 INT, col2 VARCHAR(2));
    INSERT INTO T1 VALUES
    ( 22 , 1111 , 'gt'  ),
    ( 23 , 5624 , 'ha'  ),
    ( 25 , 3333 , 'er'  ),
    ( 26 , 2113 , 'ya'  ),
    ( 28 , 9988 , 'wq'  );
    
    SELECT T.ID,T.COL1,T.COL2,MD5(CONCAT(T.COL1,T.COL2)) MD5T,
                    T1.COL1,T1.COL2,MD5(CONCAT(T1.COL1,T1.COL2)) MD5T1
    FROM T
    LEFT JOIN T1 ON T1.ID = T.ID
    HAVING MD5T <> MD5T1 OR MD5T1 IS NULL
    UNION 
    SELECT T.ID,T.COL1,T.COL2,MD5(CONCAT(T.COL1,T.COL2)) MD5T,
                    T1.COL1,T1.COL2,MD5(CONCAT(T1.COL1,T1.COL2)) MD5T1
    FROM T1
    LEFT JOIN T ON T1.ID = T.ID
    WHERE T.ID IS NULL
    ;

 ------ ------ ------ ---------------------------------- ------ ------ ---------------------------------- 
| ID   | COL1 | COL2 | MD5T                             | COL1 | COL2 | MD5T1                            |
 ------ ------ ------ ---------------------------------- ------ ------ ---------------------------------- 
|   26 | 2113 | yh   | 0a67aa31e57d3b81e9a3adedcbf52926 | 2113 | ya   | bf963ac4282e4f16339a5f79d5042dcb |
|   24 | 7775 | oh   | fde0959e35d89f4ad2b0f8c0b89e4b20 | NULL | NULL | NULL                             |
| NULL | NULL | NULL | NULL                             | 3333 | er   | eb773ea6e5dd316a4e461ce4b52b1706 |
 ------ ------ ------ ---------------------------------- ------ ------ ---------------------------------- 
3 rows in set (0.015 sec)

NB I don't know how this performs over a lot of columns but I guess there's scope for chunking the hash values.

CodePudding user response:

if you know the source and can sum a value of 1 using a window function then by definition if the result is 1 then the missing source is obvious. for example

DROP TABLE IF EXISTS T,T1,t2;

create table t(id int);

create table t1(id int);

insert into t values (1),(2),(3);

insert into t1 values (1),(4),(3);

create table t2 as
select *,'t' as src from t
union all
select *,'t1' from t1;

with cte as
(
select *,sum(1) over (partition by id) sumrows 
from t2
)
select *,
        case when src = 't' then 't1'
        else 't'
        end as 'missing' 
from cte
where sumrows = 1;

 ------ ----- --------- --------- 
| id   | src | sumrows | missing |
 ------ ----- --------- --------- 
|    2 | t   |       1 | t1      |
|    4 | t1  |       1 | t       |
 ------ ----- --------- --------- 
2 rows in set (0.013 sec)

in your case since you don't know all the columns you would need to use dynamic sql.

NB I have no idea how this would perform with a lot of columns and a lot of data and for illustration purposes I haven't attempted to optimise this process.

Also 'different in at least one field' isn't a goer but it may be if the result of the above process skins down the result set to a reasonable size.

  • Related