DataFrame
df = pd.DataFrame({
'Id': [1,1,1,1,2,2,3,4,4,4],
'Col_1':['AD11','BZ23','CQ45','DL36','LM34','MM23','DL35','AD11','BP23','CQ45'],
'Col_2':['AD11',nan,nan,'DL36',nan,nan,'DL35',nan,nan,'CQ45']]
}, columns=['Id','Col_1','Col_2'])
Looks Like
Original data frame looks like this
Please note that Col_1 & Col_2 has alpha numeric values and has more than one character. For eg : 'AD34' , 'EC45', etc.
After groupby and reset index
g = df.groupby('Id')['Col_1','Col_2'].agg(['unique'])
g= g.reset_index(drop=True)
g.columns = [''.join(col).strip() for col in g.columns.values]
I want to
- store results that match in Match column
- results that do not match No_match column
Result :
I tried to use some logic from this post but doesnt solve my issue.
Is there a better way to do the transformation for my requirement ?
Appreciate the help.
CodePudding user response:
First remove missing values from list and then use set.intersection
and set.difference
:
g = df.groupby('Id')[['Col_1','Col_2']].agg([lambda x: x.dropna().unique().tolist()])
g= g.reset_index(drop=True)
g.columns = [f'{a}_unique' for a, b in g.columns]
z = list(zip(g['Col_1_unique'], g['Col_2_unique']))
g['Match'] = [list(set(a).intersection(b)) for a, b in z]
g['No_Match'] = [list(set(a).difference(b)) for a, b in z]
print (g)
Col_1_unique Col_2_unique Match No_Match
0 [AD11, BZ23, CQ45, DL36] [AD11, DL36] [DL36, AD11] [CQ45, BZ23]
1 [LM34, MM23] [] [] [LM34, MM23]
2 [DL35] [DL35] [DL35] []
3 [AD11, BP23, CQ45] [CQ45] [CQ45] [AD11, BP23]
CodePudding user response:
Here, my simple logic is to compare both list, by same value on same positions.
Such as, [a,b,c]
& [b,a,c]
so match will be [c]
only.
Code:
df = pd.DataFrame({
'Id': [1,1,1,1,2,2,3,4,4,4],
'Col_1':['A','B','C','D','L','M','D','A','B','C'],
'Col_2':['A','','','D','','','D','', '', 'C']
}, columns=['Id','Col_1','Col_2'])
#In order to compare list by values and position I needed to add unique value on null value
#So the both list length would be same
df['Col_2'] = df.apply(lambda x : x.name if x.Col_2=='' else x.Col_2, axis=1)
g = df.groupby('Id')['Col_1','Col_2'].agg(['unique'])
g= g.reset_index(drop=True)
g.columns = [''.join(col).strip() for col in g.columns.values]
g['Match'] = g.apply(lambda x: [a for a, b in zip(x.Col_1unique, x.Col_2unique) if a==b], axis=1)
g['Not_Match'] = g.apply(lambda x: [a for a, b in zip(x.Col_1unique, x.Col_2unique) if a!=b], axis=1)
g
Output:
Col_1unique Col_2unique Match Not_Match
0 [A, B, C, D] [A, 1, 2, D] [A, D] [B, C]
1 [L, M] [4, 5] [] [L, M]
2 [D] [D] [D] []
3 [A, B, C] [7, 8, C] [C] [A, B]
CodePudding user response:
Please try to use the below code but make it more efficient, for time being i tried the below,
import pandas as pd
df = pd.DataFrame({
'Id': [1, 1, 1, 1, 2, 2, 3, 4, 4, 4],
'Col_1': ['A', 'B', 'C', 'D', 'L', 'M', 'D', 'A', 'B', 'C'],
'Col_2': ['A', 'nan', 'nan', 'D', 'nan', 'nan', 'D', 'nan', 'nan', 'C']})
print(df)
df['Match'] = ''
df['No-Match'] = ''
for i, row in df.iterrows():
if row['Col_1'] == row['Col_2']:
df.at[i, 'Match'] = row['Col_1']
else:
df.at[i, 'No-Match'] = row['Col_1']
print(df)
g = df.groupby('Id')['Id','Col_1','Col_2','Match','No-Match'].agg(['unique'])
g= g.reset_index(drop=True)
g.columns = [''.join(col).strip() for col in g.columns.values]
print(g)
Once you run this, you will get the below output:
Idunique Col_1unique Col_2unique Matchunique No-Matchunique
0 [1] [A, B, C, D] [A, nan, D] [A, D] [B, C]
1 [2] [L, M] [nan] [] [L, M]
2 [3] [D] [D] [D] []
3 [4] [A, B, C] [nan, C] [C] [A, B]