Have a question about pandas:
I have two dataframes:
df1 = pd.DataFrame({'user_id': ['12', '22', '33', '44'],
'time': ['t1', 't2', 't3', 't4'],
'data': [{'av': '8.0', 'si': 3, 'am' : 2}, {'av': '8.0', 'si': 44}, {'av': '8.0', 'si': 1}, {'av': '8.0', 'si': 22}]})
df2 = pd.DataFrame({'user_id': ['11', '22', '33', '44'],
'time': ['t1', 't2', 't3', 't4'],
'data': [{'cv': 'ff', 'si': 3}, {'cv': 'ff', 'si': 44}, {'cv': 'fa', 'si': 2}, {'cv': 'ff', 'si': 21}]})
And I need to filter df1 to reject rows contains values of 'user_id' and ['data'].'si' the same as 'user_id' and ['data'].'si' from df2 rows. If I'll do:
filter1 = df1['data'].str['si'].isin(df2['data'].str['si'])
filter2 = df1['user_id'].isin(df2['user_id'])
df3= df1[filter1 & filter2]
Result won't be valid, cause I need to reject exactly rows, where values satisfices both conditions: As example row 2 from df1
user_id time data
22 t2 'av': '8.0', 'si': 44
and from df2:
user_id time data
22 t2 'cv': 'ff', 'si': 44
Thanks a lot for any help!
CodePudding user response:
Having data normalized should ease this and every potential comparison much easier and cleaner.
Normalize Apply clean conditions:
df1 = pd.concat([df1.drop(columns=['data']), pd.json_normalize(df1.data)], axis=1)
df2 = pd.concat([df2.drop(columns=['data']), pd.json_normalize(df2.data)], axis=1)
Now dataframes look as follows:
df1:
user_id | time | av | si | am | |
---|---|---|---|---|---|
0 | 12 | t1 | 8 | 3 | 2 |
1 | 22 | t2 | 8 | 44 | nan |
2 | 33 | t3 | 8 | 1 | nan |
3 | 44 | t4 | 8 | 22 | nan |
df2:
user_id | time | cv | si | |
---|---|---|---|---|
0 | 11 | t1 | ff | 3 |
1 | 22 | t2 | ff | 44 |
2 | 33 | t3 | fa | 2 |
3 | 44 | t4 | ff | 21 |
Now you can have them merge and use indicator as follows
df1_filtered = df1.merge(df2[['user_id', 'si']], how='outer', indicator=True)
df1_filtered = df1_filtered[df1_filtered._merge.eq('left_only')].drop(columns=['_merge'])
df1_filtered
user_id | time | av | si | am | |
---|---|---|---|---|---|
0 | 12 | t1 | 8 | 3 | 2 |
2 | 33 | t3 | 8 | 1 | nan |
3 | 44 | t4 | 8 | 22 | nan |
df2_filtered = df2.merge(df1[['user_id', 'si']], how='outer', indicator=True)
df2_filtered = df2_filtered[df2_filtered._merge.eq('left_only')].drop(columns=['_merge'])
print(df2_filtered.to_markdown())
user_id | time | cv | si | |
---|---|---|---|---|
0 | 11 | t1 | ff | 3 |
2 | 33 | t3 | fa | 2 |
3 | 44 | t4 | ff | 21 |