I have a large df with many columns and rows, with usually two rows per certain identifier as df is used for reconciliation. Is there any way to streamline identification of non-identifier columns which cause mismatch?
import pandas as pd
df = pd.DataFrame({'col_1': ['A', 'B', 'C', 'B', 'C', 'D', 'E'],
'identifier': [ 1, 2, 3, 2, 3, 4, 4],
'col_3': [ 10, 20, 30, 21, 31, 40, 41],
'col_4': [ 1, 1, 1, 1, 1, 1, 1]
})
In above df, it would be
- col_1 for identifier 4 (D vs. E)
- col_3 for identifier 2/3/4 (20 vs. 21, 30 vs. 31, 40 vs. 41)
Open to any representation that makes it easy to isolate the columns causing mismatch, their values and identifiers.
CodePudding user response:
IIUC, you can agregate the columns as sets and keep those with more than one element:
s = df.groupby('identifier').agg(set).stack()
out = s[s.str.len().gt(1)]
output:
identifier
2 col_3 {20, 21}
3 col_3 {30, 31}
4 col_1 {D, E}
col_3 {40, 41}
dtype: object
further aggregation:
out.reset_index(level=1)['level_1'].groupby(level=0).agg(list)
output:
identifier
2 [col_3]
3 [col_3]
4 [col_1, col_3]
Name: level_1, dtype: object
CodePudding user response:
mismatch = df.groupby('identifier').agg(set).applymap(lambda x: x if len(x) > 1 else np.nan)
col_1_mismatch = mismatch[['col_1']].dropna()
col_3_mismatch = mismatch[['col_3']].dropna()
print(col_1_mismatch)
print(col_3_mismatch)
Output:
col_1
identifier
4 {D, E}
col_3
identifier
2 {20, 21}
3 {30, 31}
4 {40, 41}