I have the following DataFrame:
user category x y
0 AB A 1 1
1 EF A 1 1
2 SG A 1 0
3 MN A 1 0
4 AB B 0 0
5 EF B 0 1
6 SG B 0 1
7 MN B 0 0
8 AB C 1 1
9 EF C 1 1
10 SG C 1 1
11 MN C 1 1
I want to select users that have x=y
in all categories. I was able to do that using the following code:
data = pd.DataFrame({'user': ['AB', 'EF', 'SG', 'MN', 'AB', 'EF',
'SG', 'MN', 'AB', 'EF', 'SG', 'MN'],
'category': ['A', 'A', 'A', 'A', 'B', 'B',
'B', 'B', 'C', 'C', 'C', 'C'],
'x': [1,1,1,1, 0,0,0,0, 1,1,1,1],
'y': [1,1,0,0, 0,1,1,0, 1,1,1,1]})
data = data[data['x'] == data['y']][['user', 'category']]
count_users_match = data.groupby('user', as_index=False).count()
count_cat = data['category'].unique().shape[0]
print(count_users_match[count_users_match['category'] == count_cat])
Output:
user category
0 AB 3
I felt that this is a quite long solution. Is there any shorter way to achieve this?
CodePudding user response:
Try this:
filtered = df.x.eq(df.y).groupby(df['user']).sum().loc[lambda x: x == df['category'].nunique()].reset_index(name='category')
Output:
>>> filtered
user category
0 AB 3
CodePudding user response:
We could use query
groupby
size
to find the number of matching categories for each user. Then compare it with the number of categories for each user:
tmp = data.query('x==y').groupby('user').size()
out = tmp[tmp == data['category'].nunique()].reset_index(name='category')
Output:
user category
0 AB 3
CodePudding user response:
This is a more compact way to do it, but I don't know if it is also more efficient.
out = [{'user': user, 'frequency': data.loc[data['x'] == data['y']]['user'].value_counts()[user]} for user in data['user'].unique() if data.loc[data['x'] == data['y']]['user'].value_counts()[user] == data['user'].value_counts()[user]]
>>> out
[{'user': 'AB', 'frequency': 3}]