Compare 2 list columns in pandas and find the diff-CodePudding

DataFrame

 df = pd.DataFrame({
    'Id': [1,1,1,1,2,2,3,4,4,4],
    'Col_1':['AD11','BZ23','CQ45','DL36','LM34','MM23','DL35','AD11','BP23','CQ45'],
    'Col_2':['AD11',nan,nan,'DL36',nan,nan,'DL35',nan,nan,'CQ45']]
    }, columns=['Id','Col_1','Col_2'])

Looks Like

Original data frame looks like this Please note that Col_1 & Col_2 has alpha numeric values and has more than one character. For eg : 'AD34' , 'EC45', etc.
After groupby and reset index

g = df.groupby('Id')['Col_1','Col_2'].agg(['unique'])
g= g.reset_index(drop=True)
g.columns = [''.join(col).strip() for col in g.columns.values]

I want to

store results that match in Match column
results that do not match No_match column

Result :

I tried to use some logic from this post but doesnt solve my issue.

Is there a better way to do the transformation for my requirement ?

Appreciate the help.

CodePudding user response：

First remove missing values from list and then use set.intersection and set.difference:

g = df.groupby('Id')[['Col_1','Col_2']].agg([lambda x: x.dropna().unique().tolist()])
g= g.reset_index(drop=True)
g.columns = [f'{a}_unique' for a, b in g.columns]

z = list(zip(g['Col_1_unique'], g['Col_2_unique']))
g['Match'] = [list(set(a).intersection(b)) for a, b in z]
g['No_Match'] = [list(set(a).difference(b)) for a, b in z]
print (g)
               Col_1_unique  Col_2_unique         Match      No_Match
0  [AD11, BZ23, CQ45, DL36]  [AD11, DL36]  [DL36, AD11]  [CQ45, BZ23]
1              [LM34, MM23]            []            []  [LM34, MM23]
2                    [DL35]        [DL35]        [DL35]            []
3        [AD11, BP23, CQ45]        [CQ45]        [CQ45]  [AD11, BP23]

CodePudding user response：

Here, my simple logic is to compare both list, by same value on same positions. Such as, [a,b,c] & [b,a,c] so match will be [c] only.

Code:

df = pd.DataFrame({
    'Id': [1,1,1,1,2,2,3,4,4,4],
    'Col_1':['A','B','C','D','L','M','D','A','B','C'],
    'Col_2':['A','','','D','','','D','', '', 'C']
    }, columns=['Id','Col_1','Col_2'])

#In order to compare list by values and position I needed to add unique value on null value 
#So the both list length would be same
df['Col_2'] = df.apply(lambda x : x.name if x.Col_2=='' else x.Col_2, axis=1)


g = df.groupby('Id')['Col_1','Col_2'].agg(['unique'])
g= g.reset_index(drop=True)
g.columns = [''.join(col).strip() for col in g.columns.values]

g['Match'] = g.apply(lambda x: [a for a, b in zip(x.Col_1unique, x.Col_2unique) if a==b], axis=1)
g['Not_Match'] = g.apply(lambda x: [a for a, b in zip(x.Col_1unique, x.Col_2unique) if a!=b], axis=1)
g

Output:

    Col_1unique     Col_2unique     Match   Not_Match
0   [A, B, C, D]    [A, 1, 2, D]    [A, D]  [B, C]
1   [L, M]          [4, 5]          []      [L, M]
2   [D]             [D]             [D]     []
3   [A, B, C]       [7, 8, C]       [C]     [A, B]

CodePudding user response：

Please try to use the below code but make it more efficient, for time being i tried the below,

import pandas as pd

df = pd.DataFrame({
    'Id': [1, 1, 1, 1, 2, 2, 3, 4, 4, 4],
    'Col_1': ['A', 'B', 'C', 'D', 'L', 'M', 'D', 'A', 'B', 'C'],
    'Col_2': ['A', 'nan', 'nan', 'D', 'nan', 'nan', 'D', 'nan', 'nan', 'C']})

print(df)

df['Match'] = ''

df['No-Match'] = ''

for i, row in df.iterrows():

    if row['Col_1'] == row['Col_2']:
        df.at[i, 'Match'] = row['Col_1']
    else:
        df.at[i, 'No-Match'] = row['Col_1']

print(df)

g = df.groupby('Id')['Id','Col_1','Col_2','Match','No-Match'].agg(['unique'])

g= g.reset_index(drop=True)

g.columns = [''.join(col).strip() for col in g.columns.values]
print(g)

Once you run this, you will get the below output:

  Idunique   Col_1unique  Col_2unique Matchunique No-Matchunique
0      [1]  [A, B, C, D]  [A, nan, D]      [A, D]         [B, C]
1      [2]        [L, M]        [nan]          []         [L, M]
2      [3]           [D]          [D]         [D]             []
3      [4]     [A, B, C]     [nan, C]         [C]         [A, B]