Find mismatch in all columns between two rows linked by another column in dataframe-CodePudding

I have a large df with many columns and rows, with usually two rows per certain identifier as df is used for reconciliation. Is there any way to streamline identification of non-identifier columns which cause mismatch?

import pandas as pd

df = pd.DataFrame({'col_1':       ['A', 'B', 'C', 'B', 'C', 'D', 'E'],
                    'identifier': [  1,   2,   3,   2,   3,   4,   4],
                    'col_3':      [ 10,  20,   30,  21, 31,  40,  41],
                    'col_4':      [  1,   1,    1,   1,  1,   1,   1]
                    })

In above df, it would be

col_1 for identifier 4 (D vs. E)
col_3 for identifier 2/3/4 (20 vs. 21, 30 vs. 31, 40 vs. 41)

Open to any representation that makes it easy to isolate the columns causing mismatch, their values and identifiers.

CodePudding user response：

IIUC, you can agregate the columns as sets and keep those with more than one element:

s = df.groupby('identifier').agg(set).stack()
out = s[s.str.len().gt(1)]

output:

identifier       
2           col_3    {20, 21}
3           col_3    {30, 31}
4           col_1      {D, E}
            col_3    {40, 41}
dtype: object

further aggregation:

out.reset_index(level=1)['level_1'].groupby(level=0).agg(list)

output:

identifier
2           [col_3]
3           [col_3]
4    [col_1, col_3]
Name: level_1, dtype: object

CodePudding user response：

mismatch = df.groupby('identifier').agg(set).applymap(lambda x: x if len(x) > 1 else np.nan)
col_1_mismatch = mismatch[['col_1']].dropna()
col_3_mismatch = mismatch[['col_3']].dropna()
print(col_1_mismatch)
print(col_3_mismatch)

Output:

             col_1
identifier
4           {D, E}


               col_3
identifier
2           {20, 21}
3           {30, 31}
4           {40, 41}