Home > front end >  Select rows from pandas dataframe by two values at the same time from rows in another dataframe
Select rows from pandas dataframe by two values at the same time from rows in another dataframe

Time:05-22

Have a question about pandas:

I have two dataframes:

df1 = pd.DataFrame({'user_id': ['12', '22', '33', '44'],
                    'time': ['t1', 't2', 't3', 't4'],
                    'data': [{'av': '8.0', 'si': 3, 'am' : 2}, {'av': '8.0', 'si': 44}, {'av': '8.0', 'si': 1}, {'av': '8.0', 'si': 22}]})

df2 = pd.DataFrame({'user_id': ['11', '22', '33', '44'],
                    'time': ['t1', 't2', 't3', 't4'],
                    'data': [{'cv': 'ff', 'si': 3}, {'cv': 'ff', 'si': 44}, {'cv': 'fa', 'si': 2}, {'cv': 'ff', 'si': 21}]})

And I need to filter df1 to reject rows contains values of 'user_id' and ['data'].'si' the same as 'user_id' and ['data'].'si' from df2 rows. If I'll do:

filter1 = df1['data'].str['si'].isin(df2['data'].str['si'])
filter2 = df1['user_id'].isin(df2['user_id'])
df3= df1[filter1 & filter2]

Result won't be valid, cause I need to reject exactly rows, where values satisfices both conditions: As example row 2 from df1

user_id       time        data
  22           t2    'av': '8.0', 'si': 44

and from df2:

user_id       time        data
  22           t2    'cv': 'ff', 'si': 44

Thanks a lot for any help!

CodePudding user response:

Having data normalized should ease this and every potential comparison much easier and cleaner.

Normalize Apply clean conditions:

df1 = pd.concat([df1.drop(columns=['data']), pd.json_normalize(df1.data)], axis=1)
df2 = pd.concat([df2.drop(columns=['data']), pd.json_normalize(df2.data)], axis=1)

Now dataframes look as follows:

df1:

user_id time av si am
0 12 t1 8 3 2
1 22 t2 8 44 nan
2 33 t3 8 1 nan
3 44 t4 8 22 nan

df2:

user_id time cv si
0 11 t1 ff 3
1 22 t2 ff 44
2 33 t3 fa 2
3 44 t4 ff 21

Now you can have them merge and use indicator as follows

df1_filtered = df1.merge(df2[['user_id', 'si']], how='outer', indicator=True)
df1_filtered = df1_filtered[df1_filtered._merge.eq('left_only')].drop(columns=['_merge'])
df1_filtered
user_id time av si am
0 12 t1 8 3 2
2 33 t3 8 1 nan
3 44 t4 8 22 nan
df2_filtered = df2.merge(df1[['user_id', 'si']], how='outer', indicator=True)
df2_filtered = df2_filtered[df2_filtered._merge.eq('left_only')].drop(columns=['_merge'])
print(df2_filtered.to_markdown())
user_id time cv si
0 11 t1 ff 3
2 33 t3 fa 2
3 44 t4 ff 21
  • Related